# Floating point, continued

As a followup to yesterday’s post on a floating point problem I encountered using the arctangent, I want to discuss how floating point numbers are stored in computers.  Apologies in advance to my misrepresenting a lot of this (and corrections are, as always, welcome), but it is helpful for me to work it out.  If you want to just get straight to experimenting, I am writing a Python script which will be posted Wednesday that converts decimals to bits, and for mac users, a quick guide to getting started with Python will appear tomorrow.

Each of these bits is either 0 or 1, and goes towards storing a "double". From wikipedia.

We represent each (double precision) number x as (to use the lingo)

$x \approx sign*mantissa*2^{exponent-1023}$,

where the mantissa and exponent are each an integer.  We will deal with doubles, which means we get 64 bits, each either 0 or 1, to work with (see the nice figure above).  Then

1. The first bit (#63 above) records the sign of the number: 0 for positive, 1 for negative.
2. Bits 63-52 (11 bits) record an exponent E, the “size” of the number.
3. Bits 52-0 (52 bits) record the “mantissa” M of the number itself.  We always choose the mantissa to be as large as possible, which means the leading digit will always be 1, so we will actually squeeze out 53 bits of information by leaving out the leading 1.

Also notice that with 64 bits, we can hope to distinguish at most $2^{64}$ unique numbers.

Some nice hills in Montana.

As an example, we might store the number 1 as follows:
1. The sign is positive, so the first digit is 0.
2. E should be chosen so that the mantissa lies between $2^{53}$ and $2^{54}-1$.  Some inspection shows that E = 970 is the only choice, so the exponent is $2^{970-1023}= 2^{53}$. Note that in binary, 970 = 512+256+128+64+8+1 = 01111001001.
3. M should then be  $2^{53}$, stored in binary, with the leading 1 omitted.  This is just a string of 52 zeros.
4. So the 64 bits to store 1 is 0 01111001001 0000000000000000000000000000000000000000000000000000.

Now the problem: we cannot store most decimals exactly.  Let’s write the number 1.1.  Notice that if you type 1.1 into a Python prompt and hit return, 1.1000000000000001 is displayed, which is already a problem.  But anyways:

1. The sign is positive, so the first digit is 1.
2. Again we choose so the mantissa is in the correct range.  Again will be 970.
3. Now we want to choose M so that $M\cdot 2^{-51}$ is as close to 1.1 as possible, so we want to round the number $\frac{11}{10}*2^{51}$ to the nearest integer, and then encode that integer as a binary number.  This is easy to do with the Python command
 v,r = divmod(11*(2**51),10),
noting that the remainder is greater than 5 (namely, 6), so we should round up v = 9907919180215090 to v+1. Then
bin(v+1) outputs as a binary number.  In particular, we find that
1.1 “=” 0 01111001001 0001100110011001100110011001100110011001100110011001

1. The approximation is off by at most $2^{-53}$ as compared to the leading digit.  So the approximation to 1.1 was this close, whereas 4.1 might be wrong by $2^{-51}$, since the leading digit is about 4 times as big.
2. We can represent all integers between 0 and $2^{53}-1$ precisely.  After that, there is some error.