Other IEEE Floating Point Notes
Suppose x = m x 2p is
a machine representable number in single precision. Then
- The next larger machine number is
x+ = (m + 2-23) x 2p.
- The next smaller machine number is
x- = (m - 2-23) x 2p.
- These have spacing of
x+ - x = x - x- = 2-23 x 2p
= 2p-23.
- and a relative spacing of about
(x+ - x)/x = 2-23 ~ 1.2 x 10-7
This number is the precision of the representation, and is
referred to as "machine epsilon". But it plays the role users
typically think of only for values near 1.0.
For example,
let x = flt(1032) be the closest machine representable number
to 1032.
What is smallest machine number epsilon such that
flt(1032) + epsilon is not equal to flt(1032) ?
Hint: it's a lot bigger than 1.2 x 10-7.
Extended Precision
The IEEE 754 standard also specifies extended versions of single and
double precision, although the range of exponents for each simply has
a lower bound rather than an exact specification.
For extended double precision, mantissas need to have 64 bits and
have enough bits for the biased exponents to represent
at least the range from 2-16382 to 216384.
This means at least 79 bits for the full format.
IEEE Floating Point Rounding
In addition to the representation, the IEEE FP standard includes
operations, one of which is rounding.
The default is typically "round to nearest", with
round to even in case of a tie. Other modes are
- round to +inf
- round to -inf
- round to zero
There is a special name for the last mode ... what is it?
Other operations are also specified, but won't be too important
for the level our work occurs at. However, if you are working
in a field that uses special functions frequently (Bessel functions,
exponentials, trig functions) and/or you need to implement those,
a careful study of the IEEE FP operations can be useful.
Finding Floating Point Properties
The IEEE standard has removed much of the fun from finding the
properties of your machine's floating point representation. Nevertheless
it is useful to have a package which can find things like machine
precision for you - this helps you create portable code.
We will examine this in Matlab, but the
LAPACK project has produced a function called
dlamch
which returns information about the floating point system of the machine
you are running on. The values include:
eps = relative machine precision
sfmin = smallest number for which 1/sfmin does not overflow
base = base of the machine
prec = eps*base
t = (base) digits in mantissa
emin = min exponent before gradual underflow
rmin = underflow threshold: baseemin-1
emax = maximum exponent before overflow
rmax = overflow threshold: baseemax*(1-eps)
For a real workout of your machine's number system, Kahan and others
have developed the program called PARANOIA. It tests every
aspect of your floating point hardware, and was a good prod for
the development of the IEEE standard.
Digits of Accuracy
The double precision FP format is required to provide
15-17 significant digits of accuracy. What does the range mean?
- If a decimal string with 15 or fewer significant digits
is converted to double and converted back to a string, the
final string should match the initial one.
- If a double precision number with 17 significant
digits is converted to a decimal string
and then converted back to a double precision number, the final
number must match the original.
This is one requirement that many vendors fail on. It requires
the I/O libraries and conversion routines to perform to a certain
standard, which unfortunately few vendors hold to.
A critical point comes out of this: if you are writing out a double
precision number and want to retain the full accuracy, in C you need to
use the print format descriptor %21.17e.
Languages and IEEE Floating Point
It has long been a sore point with numerical analysts that the IEEE
standard has typically been implemented in hardware, but there is little or
no high-level language access to, e.g., setting rounding modes from
within a code. Some particular features:
- Java's
native double won't let you directly manipulate +/-inf or NaN, instead you must
use the wrapper object Double. That was created so that doubles could
be put into hash tables. But it means, e.g., that
a NaN double is not equal to itself. However, a NaN Double object is
equal to itself.
- Not surprisingly, Fortran is the best at numerical computing - that is
what it was built for. The language itself provides inquiry functions for returning
precision, range, largest (hugh()) and smallest (tiny()) FP information. You
can also specify a variable to have given precision and range rather than the
IEEE type. E.g.,
real(kind=selected_real_kind(10,50)) :: d
defines the number d with at least 10 decimal digits of
precision and exponents that can handle the
range from 10-50 to 10-50. Fortran also provides
inquiry functions for a given value, e.g., spacing(d) gives the absolute
spacing of numbers near d.
If you want to see
some results from Fortran's inquiry functions, see
the file SystemInformation_charis
If you want to see
how to use and invoke those utilities, grab the
tarball sysfacts.tar
or the
zipfile sysfacts.zip.
- C/C++ have header files (e.g., math.h) on most systems that define
quantities like largest/smallest floats or doubles, macros to test if a
number is infinity or NaN, etc. It also provides nifty constants, like
pi = 3.1415926535897932384626433832795029.
Potpourri
- If you use the kind of tests in
FpFacts.m for finding
floating point extrema such as smallest exponent, largest exponent,
etc., then you may trigger a "floating point exception".
Depending on the system you are using, that may cause your
program to fail. On SGI's there is an environment variable
TRAP_FPE you can set to specify how to react to floating point
exceptions. You can also write your own "exception handler"
on any IEEE-compliant machine.
- Also on the SGIs, you can use the function fp_class() to determine
what flavor of floating point number the argument is. Look at
the man pages for it if you plan on using it. However, that is
SGI specific.
Go
back home
- Last Modified: Mon 04 Nov 2019, 07:05 AM