P573: FP Representation

Floating Point Number Representation

Floating point arithmetic catastrophes occur in places other than carefully contrived pathological examples. In LU factorization the numerical computations can be drastically wrong when only a small pivot is available since the subdiagonal entries in its column are divided by it. What is "small"? When should it be worried about? How to fix the problem if it occurs?

Studying floating point operations requires knowing the machine representation of floats. The IEEE 754 floating point standard (Kahan, circa 1976-1989) is now followed by virtually all workstation manufacturers. It is also followed on Intel chips, the usual PC processors.

The full story of how the standard arose is still worth reading as an example of how (good) standards are evolved and established. A major motivation was Intel's plans for the x86 family of chips in the mid-1970s. At that time (and through the 1990s on some platforms) different machines could return greatly differing results in floating point arithmetic. For valid machine representable numbers x there were cases where

the multiplication 1.0*x would overflow
(x == x) was true, but (x-x == 0) was false
... and vice-versa: (x-y == 0) but (x == y) is false
x/x would overflow
x was ok, but -x caused overflow
a quarter of the significant bits could be dropped when subtracting two numbers with the same exponent

Those (and a whole zoo of other vile creations) are the reason you will see lines like

    x = (x+x)-x

in some codes.

IEEE standard has two parts: representation, and operations. Many vendors say they are IEEE compliant when they only follow the representation standard. This page covers the single precision floating point standard, which uses 4 bytes for a floating point number. You should work through the following for double precision (8 byte words) as well, and not just for practice. All of our FP computations will be in double precision, and you need to be able to spot any famously bogus numbers that show up.

The single precision format uses 32 bits (4 bytes):

1 bit for sign of number
8 bits for biased exponent
23 bits for mantissa or fraction.

So it looks like:

           |s|---e----|----------f------------|
            - -------- -----------------------|
            0 1      8 9                     31

where the number below represents the bit number, starting from 0.

s = 0 means that the number is positive
The exponent is biased by 127 = 2⁷ - 1, that is, e = p + 127, where p is the ``real'' exponent.
The fraction has implied binary point before the string of bits given by f, and the unsigned significand is 1+f, as though there were an implicit 1 before the binary point.

Consider the following example IEEE floating point number, where the bits have been clustered for visual ease:

        0 0000 1110 1010 0000 0000 0000 0000 000

In the example, the number is positive since the first bit is 0.
Exponent: since 127₁₀ = 0111 1111₂, the real exponent is
p = 00001110₂ - 01111111₂ = -01110001₂ = -113₁₀
Fractional part is: 1 + 1/2 + 1/8 = 1.625 (the first 1 comes from the implicit leading bit of one).

So the number is 1.625 x 2^-113 ~ 1.5648 x 10^-34

Special Values

Special values are useful but potential efficiency problems. The values +0, -0, +inf, -inf, NaN are represented using exponents of e = 0, 255.

+0 has a string of all zeros, including sign bit.
-0 has a string of all zeros, except sign bit s = 1. A negative zero may seem non-mathematical, but it is useful to indicate gradual underflow in a computation, where "convergence" is from the left on the real number line.
+inf has e = 255, f = 0, s = 0. This occurs from an operation such as 1.0/0.0, or log(+0.0).
-inf is same but for s = 1. The division -1.0/0.0 gives a negative infinity.
NaN is represented with e = 255 and f ~= 0. NaN occurs from operations like sqrt(-17.0), or 0.0/0.0.

NaN is useful, but has some tricky features. First of all, NaN is not equal to NaN. Any numerical operation that involves a NaN will end up giving you NaN (so, in particular, NaN - NaN is not zero). While +inf is equal to +inf, +inf - +inf is also not zero. In Java, for example, you must use the test if (Double.isNaN(d)) to check if the number d is in fact NaN.

Another set of special values, the denormalized numbers, help extend the range of representation. Note that the special values above are triggered by either e = 0 or e = 255. However, the case where e = 0 but f ~= 0 is not defined yet. That indicates a denormalized or subnormalized number, one with an exponent of p = -126 and the fractional part without an implicit leading 1 bit.

Range of floats

Since the stored exponent's range for standard floating point numbers is 0 < e < 255, the real exponent satisfies -127 < p < 128. Note that we have to avoid the values of 0 and 255 because they signal special vallues not part of the normalized number spread. The smallest single precision positive number is thus
1.0000 0000 0000 0000 0000 000 x 2^{-0111 1110} = 2^-126 ~ 1.2 x 10^-38

The largest single precision positive number is 1.1111 1111 1111 1111 1111 111 x 2^{1111 1110} = (2-2^-23)x 2¹²⁷ ~ 3.4 x 10³⁸

For numbers smaller than 2^-126, the IEEE standard uses denormalized numbers. This makes the smallest positive number representable
2^-23 x 2^-126 = 2^-149
Denormalized numbers typically occur with gradual underflow, for example when you repeatedly divide a number by a number greater than one. However, they are usually catastrophic in terms of efficiency. Computer manufacturers put lots of money into getting fast floating point performance, and they do so by specializing for the "usual" case. Arithmetic with denormalized numbers is not the "usual" case, so typically those operations are implemented in software, not hardware. That in turn means a computational rate around 10³ times slower than for other numbers.

Denormalized number capability can also slow down operations with zeros. A denormalized number is signaled by having an exponent of zero, just like the plus/minus zeros do. If a machine checks the exponent field to see if the number is possibly denormalized, then it must next check the mantissa. Only if it is zero can the number be fed to the floating point arithmetic unit. This checking can add a 30% time penalty to the operation with zeros.

Next page: Other Floating Point Notes

Go back home

Last Modified: Mon 04 Nov 2019, 07:05 AM