Floating Point Number Representation


Floating point arithmetic catastrophes occur in places other than carefully contrived pathological examples. In LU factorization the numerical computations can be drastically wrong when only a small pivot is available since the subdiagonal entries in its column are divided by it. What is "small"? When should it be worried about? How to fix the problem if it occurs?

Studying floating point operations requires knowing the machine representation of floats. The IEEE 754 floating point standard (Kahan, circa 1976-1989) is now followed by virtually all workstation manufacturers. It is also followed on Intel chips, the usual PC processors.

The full story of how the standard arose is still worth reading as an example of how (good) standards are evolved and established. A major motivation was Intel's plans for the x86 family of chips in the mid-1970s. At that time (and through the 1990s on some platforms) different machines could return greatly differing results in floating point arithmetic. For valid machine representable numbers x there were cases where

Those (and a whole zoo of other vile creations) are the reason you will see lines like

    x = (x+x)-x

in some codes.

IEEE standard has two parts: representation, and operations. Many vendors say they are IEEE compliant when they only follow the representation standard. This page covers the single precision floating point standard, which uses 4 bytes for a floating point number. You should work through the following for double precision (8 byte words) as well, and not just for practice. All of our FP computations will be in double precision, and you need to be able to spot any famously bogus numbers that show up.

The single precision format uses 32 bits (4 bytes):

So it looks like:
           |s|---e----|----------f------------|
            - -------- -----------------------|
            0 1      8 9                     31
where the number below represents the bit number, starting from 0. Consider the following example IEEE floating point number, where the bits have been clustered for visual ease:
        0 0000 1110 1010 0000 0000 0000 0000 000
So the number is 1.625 x 2-113 ~ 1.5648 x 10-34

Special Values

Special values are useful but potential efficiency problems. The values +0, -0, +inf, -inf, NaN are represented using exponents of e = 0, 255. NaN is useful, but has some tricky features. First of all, NaN is not equal to NaN. Any numerical operation that involves a NaN will end up giving you NaN (so, in particular, NaN - NaN is not zero). While +inf is equal to +inf, +inf - +inf is also not zero. In Java, for example, you must use the test
if (Double.isNaN(d))
to check if the number d is in fact NaN.

Another set of special values, the denormalized numbers, help extend the range of representation. Note that the special values above are triggered by either e = 0 or e = 255. However, the case where e = 0 but f ~= 0 is not defined yet. That indicates a denormalized or subnormalized number, one with an exponent of p = -126 and the fractional part without an implicit leading 1 bit.

Range of floats

Since the stored exponent's range for standard floating point numbers is 0 < e < 255, the real exponent satisfies -127 < p < 128. Note that we have to avoid the values of 0 and 255 because they signal special vallues not part of the normalized number spread. The smallest single precision positive number is thus
1.0000 0000 0000 0000 0000 000 x 2-0111 1110 = 2-126 ~ 1.2 x 10-38

The largest single precision positive number is 1.1111 1111 1111 1111 1111 111 x 21111 1110 = (2-2-23)x 2127 ~ 3.4 x 1038

For numbers smaller than 2-126, the IEEE standard uses denormalized numbers. This makes the smallest positive number representable
2-23 x 2-126 = 2-149
Denormalized numbers typically occur with gradual underflow, for example when you repeatedly divide a number by a number greater than one. However, they are usually catastrophic in terms of efficiency. Computer manufacturers put lots of money into getting fast floating point performance, and they do so by specializing for the "usual" case. Arithmetic with denormalized numbers is not the "usual" case, so typically those operations are implemented in software, not hardware. That in turn means a computational rate around 103 times slower than for other numbers.

Denormalized number capability can also slow down operations with zeros. A denormalized number is signaled by having an exponent of zero, just like the plus/minus zeros do. If a machine checks the exponent field to see if the number is possibly denormalized, then it must next check the mantissa. Only if it is zero can the number be fed to the floating point arithmetic unit. This checking can add a 30% time penalty to the operation with zeros.


Next page: Other Floating Point Notes

Go back home