Other IEEE Floating Point Notes

Suppose x = m x 2p is a machine representable number in single precision. Then This number is the precision of the representation, and is referred to as "machine epsilon". But it plays the role users typically think of only for values near 1.0. For example, let x = flt(1032) be the closest machine representable number to 1032. What is smallest machine number epsilon such that flt(1032) + epsilon is not equal to flt(1032) ? Hint: it's a lot bigger than 1.2 x 10-7.

Extended Precision

The IEEE 754 standard also specifies extended versions of single and double precision, although the range of exponents for each simply has a lower bound rather than an exact specification. For extended double precision, mantissas need to have 64 bits and have enough bits for the biased exponents to represent at least the range from 2-16382 to 216384. This means at least 79 bits for the full format.

IEEE Floating Point Rounding

In addition to the representation, the IEEE FP standard includes operations, one of which is rounding. The default is typically "round to nearest", with round to even in case of a tie. Other modes are There is a special name for the last mode ... what is it? Other operations are also specified, but won't be too important for the level our work occurs at. However, if you are working in a field that uses special functions frequently (Bessel functions, exponentials, trig functions) and/or you need to implement those, a careful study of the IEEE FP operations can be useful.

Finding Floating Point Properties

The IEEE standard has removed much of the fun from finding the properties of your machine's floating point representation. Nevertheless it is useful to have a package which can find things like machine precision for you - this helps you create portable code. We will examine this in Matlab, but the LAPACK project has produced a function called dlamch which returns information about the floating point system of the machine you are running on. The values include:

  eps   = relative machine precision
  sfmin = smallest number for which 1/sfmin does not overflow
  base  = base of the machine
  prec  = eps*base
  t     = (base) digits in mantissa
  emin  = min exponent before gradual underflow
  rmin  = underflow threshold: baseemin-1
  emax  = maximum exponent before overflow
  rmax  = overflow threshold: baseemax*(1-eps)
For a real workout of your machine's number system, Kahan and others have developed the program called PARANOIA. It tests every aspect of your floating point hardware, and was a good prod for the development of the IEEE standard.

Digits of Accuracy

The double precision FP format is required to provide 15-17 significant digits of accuracy. What does the range mean? This is one requirement that many vendors fail on. It requires the I/O libraries and conversion routines to perform to a certain standard, which unfortunately few vendors hold to. A critical point comes out of this: if you are writing out a double precision number and want to retain the full accuracy, in C you need to use the print format descriptor %21.17e.

Languages and IEEE Floating Point

It has long been a sore point with numerical analysts that the IEEE standard has typically been implemented in hardware, but there is little or no high-level language access to, e.g., setting rounding modes from within a code. Some particular features:

Potpourri


Go back home