P573: Other IEEE Floating Point Notes

Other IEEE Floating Point Notes

Suppose x = m x 2^p is a machine representable number in single precision. Then

The next larger machine number is x₊ = (m + 2^-23) x 2^p.
The next smaller machine number is x_- = (m - 2^-23) x 2^p.
These have spacing of x₊ - x = x - x_- = 2^-23 x 2^p = 2^p-23.
and a relative spacing of about (x₊ - x)/x = 2^-23 ~ 1.2 x 10^-7

This number is the precision of the representation, and is referred to as "machine epsilon". But it plays the role users typically think of only for values near 1.0. For example, let x = flt(10³²) be the closest machine representable number to 10³². What is smallest machine number epsilon such that flt(10³²) + epsilon is not equal to flt(10³²) ? Hint: it's a lot bigger than 1.2 x 10^-7.

Extended Precision

The IEEE 754 standard also specifies extended versions of single and double precision, although the range of exponents for each simply has a lower bound rather than an exact specification. For extended double precision, mantissas need to have 64 bits and have enough bits for the biased exponents to represent at least the range from 2^-16382 to 2¹⁶³⁸⁴. This means at least 79 bits for the full format.

IEEE Floating Point Rounding

In addition to the representation, the IEEE FP standard includes operations, one of which is rounding. The default is typically "round to nearest", with round to even in case of a tie. Other modes are

round to +inf
round to -inf
round to zero

There is a special name for the last mode ... what is it? Other operations are also specified, but won't be too important for the level our work occurs at. However, if you are working in a field that uses special functions frequently (Bessel functions, exponentials, trig functions) and/or you need to implement those, a careful study of the IEEE FP operations can be useful.

Finding Floating Point Properties

The IEEE standard has removed much of the fun from finding the properties of your machine's floating point representation. Nevertheless it is useful to have a package which can find things like machine precision for you - this helps you create portable code. We will examine this in Matlab, but the LAPACK project has produced a function called dlamch which returns information about the floating point system of the machine you are running on. The values include:

  eps   = relative machine precision
  sfmin = smallest number for which 1/sfmin does not overflow
  base  = base of the machine
  prec  = eps*base
  t     = (base) digits in mantissa
  emin  = min exponent before gradual underflow
  rmin  = underflow threshold: base^emin-1
  emax  = maximum exponent before overflow
  rmax  = overflow threshold: base^emax*(1-eps)

For a real workout of your machine's number system, Kahan and others have developed the program called PARANOIA. It tests every aspect of your floating point hardware, and was a good prod for the development of the IEEE standard.

Digits of Accuracy

The double precision FP format is required to provide 15-17 significant digits of accuracy. What does the range mean?

If a decimal string with 15 or fewer significant digits is converted to double and converted back to a string, the final string should match the initial one.
If a double precision number with 17 significant digits is converted to a decimal string and then converted back to a double precision number, the final number must match the original.

This is one requirement that many vendors fail on. It requires the I/O libraries and conversion routines to perform to a certain standard, which unfortunately few vendors hold to. A critical point comes out of this: if you are writing out a double precision number and want to retain the full accuracy, in C you need to use the print format descriptor %21.17e.

Languages and IEEE Floating Point

It has long been a sore point with numerical analysts that the IEEE standard has typically been implemented in hardware, but there is little or no high-level language access to, e.g., setting rounding modes from within a code. Some particular features:

Java's native double won't let you directly manipulate +/-inf or NaN, instead you must use the wrapper object Double. That was created so that doubles could be put into hash tables. But it means, e.g., that a NaN double is not equal to itself. However, a NaN Double object is equal to itself.
Not surprisingly, Fortran is the best at numerical computing - that is what it was built for. The language itself provides inquiry functions for returning precision, range, largest (hugh()) and smallest (tiny()) FP information. You can also specify a variable to have given precision and range rather than the IEEE type. E.g., real(kind=selected_real_kind(10,50)) :: d defines the number d with at least 10 decimal digits of precision and exponents that can handle the range from 10^-50 to 10^-50. Fortran also provides inquiry functions for a given value, e.g., spacing(d) gives the absolute spacing of numbers near d.
If you want to see some results from Fortran's inquiry functions, see the file SystemInformation_charis If you want to see how to use and invoke those utilities, grab the tarball sysfacts.tar or the zipfile sysfacts.zip.
C/C++ have header files (e.g., math.h) on most systems that define quantities like largest/smallest floats or doubles, macros to test if a number is infinity or NaN, etc. It also provides nifty constants, like pi = 3.1415926535897932384626433832795029.

Potpourri

If you use the kind of tests in FpFacts.m for finding floating point extrema such as smallest exponent, largest exponent, etc., then you may trigger a "floating point exception". Depending on the system you are using, that may cause your program to fail. On SGI's there is an environment variable TRAP_FPE you can set to specify how to react to floating point exceptions. You can also write your own "exception handler" on any IEEE-compliant machine.
Also on the SGIs, you can use the function fp_class() to determine what flavor of floating point number the argument is. Look at the man pages for it if you plan on using it. However, that is SGI specific.

Go back home

Last Modified: Mon 04 Nov 2019, 07:05 AM