Finance          Automotive          Computers          Health          Shopping          Sports         News          Reference           Print Facts in English - BCUZ.COMlos hechos en Español

Floating point



To multiply, the significands are multiplied while the exponents are added, and the result is rounded and normalized.

  e=3;  s=4.734612
× e=5;  s=5.417242
-----------------------
  e=8;  s=25.648538980104 (true product)
  e=8;  s=25.64854        (after rounding)
  e=9;  s=2.564854        (after normalization)

Division is done similarly, but is more complicated.

There are no cancellation or absorption problems with multiplication or division, though small errors may accumulate as operations are performed repeatedly. In practice, the way these operations are carried out in digital logic can be quite complex (see Booth's multiplication algorithm and digital division).[4] For a fast, simple method, see the Horner method.

[

Dealing with exceptional cases

Floating-point computation in a computer can run into three kinds of problems:

  • An operation can be mathematically illegal, such as division by zero.
  • An operation can be legal in principle, but not supported by the specific format, for example, calculating the square root of −1 or the inverse sine of 2 (both of which result in complex numbers).
  • An operation can be legal in principle, but the result can be impossible to represent in the specified format, because the exponent is too large or too small to encode in the exponent field. Such an event is called an overflow (exponent too large) or underflow (exponent too small).

Prior to the IEEE standard, such conditions usually caused the program to terminate, or triggered some kind of trap that the programmer might be able to catch. How this worked was system-dependent, meaning that floating-point programs were not portable. Modern IEEE-compliant systems have a uniform way of handling these situations. An important part of the mechanism involves error values that result from a failing computation, and that can propagate silently through subsequent computation until they are detected at a point of the programmer's choosing.

The two error values are "infinity" (often denoted "INF"), and "NaN" ("not a number"), which covers all other errors. "Infinity" does not necessarily mean that the result is actually infinite. It simply means "too large to represent".

Both of these are encoded with the exponent field set to all 1's. (Recall that exponent fields of all 0's or all 1's are reserved for special meanings.) The significand field is set to something that can distinguish them—typically zero for INF and nonzero for NaN. The sign bit is meaningful for INF, that is, floating-point hardware distinguishes between +∞ and −∞.

When a nonzero number is divided by zero (the divisor must be exactly zero), a "zerodivide" event occurs, and the result is set to infinity of the appropriate sign. In other cases in which the result's exponent is too large to represent, such as division of an extremely large number by an extremely small number, an "overflow" event occurs, also producing infinity of the appropriate sign. This is different from a zerodivide, though both produce a result of infinity, and the distinction is usually unimportant in practice.

Floating-point hardware is generally designed to handle operands of infinity in a reasonable way, such as

  • (+INF) + (+7) = (+INF)
  • (+INF) × (−2) = (−INF)
  • But: (+INF) × 0 = NaN—there is no meaningful thing to do

When the result of an operation has an exponent too small to represent properly, an "underflow" event occurs. The hardware responds to this by changing to a format in which the significand is not normalized, and there is no "hidden" bit—that is, all bits of the significand are represented. The exponent field is set to the reserved value of zero. The significand is set to whatever it has to be in order to be consistent with the exponent. Such a number is said to be "denormalized" (a "denorm" for short), or, in more modern terminology, "subnormal". Denorms are perfectly legal operands to arithmetic operations.

If no significant bits are able to appear in the significand field, the number is zero. Note that, in this case, the exponent field and significand field are all zeros—floating-point zero is represented by all zeros.

The mandated behavior for dealing with overflow and underflow is that the appropriate result is computed, taking the rounding mode into consideration, as though the exponent range were infinitely large. If that resulting exponent can't be packed into its field correctly, the overflow/underflow action described above is taken.

Other errors, such as division of zero by zero, or taking the square root of −1, cause an "operand error" event, and produce a NaN result. NaNs propagate aggressively through arithmetic operations—any NaN operand to any operation causes an operand error and produces a NaN result.

In summary, there are five special "events" that may occur, though some of them are quite benign:

  • An overflow occurs as described previously, producing an infinity.
  • An underflow occurs as described previously, producing a denorm or zero.
  • A zerodivide occurs as described previously, producing an infinity of the appropriate sign.
  • An "operand error" occurs as described previously, producing a NaN.
  • An "inexact" event occurs whenever the rounding of a result changed that result from the true mathematical value. This occurs almost all the time, and is usually ignored. It is looked at only in the most exacting applications.

Computer hardware is typically able to raise exceptions when these events occur. How this is done is system-dependent. Usually these exceptions are all masked (disabled), relying only on the propagation of error values. Sometimes overflow, zerodivide, and operand error are enabled.

[

Accuracy problems

The fact that floating-point numbers cannot faithfully mimic the real numbers, and that floating-point operations cannot faithfully mimic true arithmetic operations, leads to many surprising situations.

For example, the non-representability of 0.1 and 0.01 means that the result of attempting to square 0.1 is neither 0.01 nor the representable number closest to it. In 24-bit (single precision) representation, 0.1 (decimal) was given previously as e = −4; s = 110011001100110011001101, which is

0.100000001490116119384765625 exactly.

Squaring this number gives

0.010000000298023226097399174250313080847263336181640625 exactly.

Squaring it with single-precision floating-point hardware (with rounding) gives

0.010000000707805156707763671875 exactly.

But the representable number closest to 0.01 is

0.009999999776482582092285156250 exactly.

Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not yield a result of infinity, nor will it even overflow. It is simply not possible for standard floating-point hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation in C:

  // Enough digits to be sure we get the correct approximation.
  double pi = 3.1415926535897932384626433832795;
  double z = tan(pi/2.0);

will give a result of 16331239353195370.0. In single precision (using the tanf function), the result will be −22877332.0.

By the same token, an attempted computation of sin(π) will not yield zero. The result will be (approximately) 0.1225×10-15 in double precision, or −0.8742×10-7 in single precision.[5]

While floating-point addition and multiplication are both commutative (a + b = b + a and a×b = b×a), they are not necessarily associative. That is, (a + b) + c is not necessarily equal to a + (b + c). Using 7-digit decimal arithmetic:

 1234.567 + 45.67844 = 1280.245
                       1280.245 + 0.0004 = 1280.245
 but 
 45.67840 + 0.0004 = 45.67844
                     45.67844 + 1234.567 = 1280.246

They are also not necessarily distributive. That is, (a + b) ×c may not be the same as a×c + b×c:

 1234.567 × 3.333333 = 4115.223
 1.234567 × 3.333333 = 4.115223
                       4115.223 + 4.115223 = 4119.338
 but 
 1234.567 + 1.234567 = 1235.802
                       1235.802 × 3.333333 = 4119.340

In addition to loss of significance, inability to represent numbers such as π and 0.1 exactly, and other slight inaccuracies, the following phenomena may occur:

  • Cancellation: subtraction of nearly equal operands may cause extreme loss of accuracy. This is perhaps the most common and serious accuracy problem.
  • Conversions to integer are unforgiving: converting (63.0/9.0) to integer yields 7, but converting (0.63/0.09) may yield 6. This is because conversions generally truncate rather than round.
  • Limited exponent range: results might overflow yielding infinity, or underflow yielding a denormal value or zero. If a denormal number results, precision will be lost.
  • Testing for safe division is problematic: Checking that the divisor is not zero does not guarantee that a division will not overflow and yield infinity.
  • Testing for equality is problematic. Two computational sequences that are mathematically equal may well produce different floating-point values. Programmers often perform comparisons within some tolerance (often a decimal constant, itself not accurately represented), but that doesn't necessarily make the problem go away.

[

Minimizing the effect of accuracy problems

Because of the issues noted above, naive use of floating-point arithmetic can lead to many problems. The creation of thoroughly robust floating-point software is a complicated undertaking, and a good understanding of numerical analysis is essential.

In addition to careful design of programs, careful handling by the compiler is required. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area. See the external references at the bottom of this article.

Floating-point arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of Io or the mass of the proton), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation.[6] The "decimal" data type of the C# programming language, and the IEEE 854 standard, are designed to avoid the problems of binary floating-point representation, and make the arithmetic always behave as expected when numbers are printed in decimal.

Small errors in floating-point arithmetic can grow when mathematical algorithms perform operations an enormous number of times. A few examples are matrix inversion, eigenvector computation, and differential equation solving. These algorithms must be very carefully designed if they are to work well.

Expectations from mathematics may not be realised in the field of floating-point computation. For example, it is known that (x+y)(x-y) = x^2-y^2\,, and that \sin^2{\theta}+\cos^2{\theta} = 1\,. These facts cannot be counted on when the quantities involved are the result of floating-point computation.

A detailed treatment of the techniques for writing high-quality floating-point software is beyond the scope of this article, and the reader is referred to the references at the bottom of this article. Descriptions of a few simple techniques follow.

The use of the equality test (if (x==y) ...) is usually not recommended when expectations are based on results from pure mathematics. Such tests are sometimes replaced with "fuzzy" comparisons (if (abs(x-y) < epsilon) ...), where epsilon is sufficiently small and tailored to the application, such as 1.0E-13). The wisdom of doing this varies greatly. It is often better to organize the code in such a way that such tests are unnecessary.

An awareness of when loss of significance can occur is useful. For example, if one is adding a very large number of numbers, the individual addends are very small compared with the sum. This can lead to loss of significance. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000. A typical addition would then be something like

3253.671
+  3.141276
--------
3256.812

The low 3 digits of the addends are effectively lost. The Kahan summation algorithm may be used to reduce the errors.

Computations may be rearranged in a way that is mathematically equivalent but less prone to error. As an example, Archimedes approximated π by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. The recurrence formula for the circumscribed polygon is:

t_0 = \frac{1}{\sqrt{3}}
t_{i+1} = \frac{\sqrt{t_i^2+1}-1}{t_i}\qquad\mathrm{second\ form:}\qquad t_{i+1} = \frac{t_i}{\sqrt{t_i^2+1}+1}
\pi \sim 6 \times 2^i \times t_i,\qquad\mathrm{converging\ as\ i \rightarrow \infty}\,

Here is a computation using IEEE "double" (53 bits of significand precision) arithmetic:

 i   6 × 2i × ti, first form    6 × 2i × ti, second form

 0   3.4641016151377543863      3.4641016151377543863
 1   3.2153903091734710173      3.2153903091734723496
 2   3.1596599420974940120      3.1596599420975006733
 3   3.1460862151314012979      3.1460862151314352708
 4   3.1427145996453136334      3.1427145996453689225
 5   3.1418730499801259536      3.1418730499798241950
 6   3.1416627470548084133      3.1416627470568494473
 7   3.1416101765997805905      3.1416101766046906629
 8   3.1415970343230776862      3.1415970343215275928
 9   3.1415937488171150615      3.1415937487713536668
10   3.1415929278733740748      3.1415929273850979885
11   3.1415927256228504127      3.1415927220386148377
12   3.1415926717412858693      3.1415926707019992125
13   3.1415926189011456060      3.1415926578678454728
14   3.1415926717412858693      3.1415926546593073709
15   3.1415919358822321783      3.1415926538571730119
16   3.1415926717412858693      3.1415926536566394222
17   3.1415810075796233302      3.1415926536065061913
18   3.1415926717412858693      3.1415926535939728836
19   3.1414061547378810956      3.1415926535908393901
20   3.1405434924008406305      3.1415926535900560168
21   3.1400068646912273617      3.1415926535898608396
22   3.1349453756585929919      3.1415926535898122118
23   3.1400068646912273617      3.1415926535897995552
24   3.2245152435345525443      3.1415926535897968907
25                              3.1415926535897962246
26                              3.1415926535897962246
27                              3.1415926535897962246
28                              3.1415926535897962246
              The true value is 3.141592653589793238462643383...

While the two forms of the recurrence formula are clearly equivalent, the first subtracts 1 from a number extremely close to 1, leading to huge cancellation errors. Note that, as the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.

[

See also

[

Notes and references

  1. ^ Revising ANSI/IEEE Std 754-1985 http://754r.ucbtest.org
  2. ^ Haohuan Fu, Oskar Mencer, Wayne Luk (Dec. 2006). "Comparing Floating-point and Logarithmic Number Representations for Reconfigurable Acceleration". IEEE Conference on Field Programmable Technology. 
  3. ^ Computer hardware doesn't necessarily compute the exact value; it simply has to produce the equivalent rounded result as though it had computed the infinitely precise result.
  4. ^ The enormous complexity of modern division algorithms once led to a famous error. An early version of the Intel Pentium chip was shipped with a division instruction that, on rare occasions, gave slightly incorrect results. Many computers had been shipped before the error was discovered. Until the defective computers were replaced, patched versions of compilers were developed that could avoid the failing cases. See Pentium FDIV bug.
  5. ^ But an attempted computation of cos(π) yields −1 exactly. Since the derivative is nearly zero near π, the effect of the inaccuracy in the argument is far smaller than the spacing of the floating-point numbers around −1, and the rounded result is exact.
  6. ^ General Decimal Arithmetic

[

External links




BCUZ.com FACTS Encyclopedia content is licensed under the GFDL as approved by Wikipedia.
For more information review our copyright contact and privacy policy.
© 1996 - BCUZ.COM - We have all the FACTS you need about Small Business Financing, Behavior Disorder, Having Too Many Bills, Needing Cash Fast, Structured Settlements, Frequent Flier Programs, Top Steak Houses, The Mayan Indians, Norfolk and Suffolk England, Growing Longer Hair and a full reference English Encyclopedia and Spanish Encyclopedia.Privacy Policy