As a followup to yesterday’s post on a floating point problem I encountered using the arctangent, I want to discuss how floating point numbers are stored in computers. Apologies in advance to my misrepresenting a lot of this (and corrections are, as always, welcome), but it is helpful for me to work it out. If you want to just get straight to experimenting, I am writing a Python script which will be posted Wednesday that converts decimals to bits, and for mac users, a quick guide to getting started with Python will appear tomorrow.
We represent each (double precision) number x as (to use the lingo)
1. The first bit (#63 above) records the sign of the number: 0 for positive, 1 for negative.
2. Bits 63-52 (11 bits) record an exponent E, the “size” of the number.
3. Bits 52-0 (52 bits) record the “mantissa” M of the number itself. We always choose the mantissa to be as large as possible, which means the leading digit will always be 1, so we will actually squeeze out 53 bits of information by leaving out the leading 1.
Also notice that with 64 bits, we can hope to distinguish at most unique numbers.As an example, we might store the number 1 as follows:
1. The sign is positive, so the first digit is 0.
2. E should be chosen so that the mantissa lies between and . Some inspection shows that E = 970 is the only choice, so the exponent is . Note that in binary, 970 = 512+256+128+64+8+1 = 01111001001.
3. M should then be , stored in binary, with the leading 1 omitted. This is just a string of 52 zeros.
4. So the 64 bits to store 1 is 0 01111001001 0000000000000000000000000000000000000000000000000000.
Now the problem: we cannot store most decimals exactly. Let’s write the number 1.1. Notice that if you type 1.1 into a Python prompt and hit return, 1.1000000000000001 is displayed, which is already a problem. But anyways:
1. The sign is positive, so the first digit is 1.
2. Again we choose E so the mantissa is in the correct range. Again E will be 970.
3. Now we want to choose M so that is as close to 1.1 as possible, so we want to round the number to the nearest integer, and then encode that integer as a binary number. This is easy to do with the Python command
v,r = divmod(11*(2**51),10),
noting that the remainder is greater than 5 (namely, 6), so we should round up v = 9907919180215090 to v+1. Then
bin(v+1) outputs M as a binary number. In particular, we find that
1.1 “=” 0 01111001001 0001100110011001100110011001100110011001100110011001
1. The approximation is off by at most as compared to the leading digit. So the approximation to 1.1 was this close, whereas 4.1 might be wrong by , since the leading digit is about 4 times as big.
2. We can represent all integers between 0 and precisely. After that, there is some error.