Information AboutIeee 754 |
| CATEGORIES ABOUT IEEE 754 | |
| computer arithmetic | |
| ieee standards | |
|
IEEE 754 specifies four formats for representing floating-point values: single-precision (32-bit), double-precision (64-bit), single-extended precision (≥ 43-bit, not commonly used) and double-extended precision (≥ 79-bit, usually implemented with 80 bits). Only 32-bit values are required by the standard; the others are optional. Many languages specify that IEEE formats and arithmetic be implemented, although sometimes it is optional. For example, the C Programming Language , which pre-dated IEEE 754, now allows but does not require IEEE arithmetic (the C float typically is used for IEEE single-precision and double uses IEEE double-precision). The full title of the standard is IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985), and it is also known as '''IEC 60559:1989, Binary floating-point arithmetic for microprocessor systems''' (originally the reference number was IEC 559:1989). {Link without Title} Later there was an IEEE 854-1987 for "radix independent floating point" as long as the radix is 2 or 10. ANATOMY OF A FLOATING-POINT NUMBER Following is a description of the standards' format for floating-point numbers. Bit conventions used in this article Bit s within a Word of width W are indexed by Integer s in the range 0 to W−1 Inclusive . The bit with index 0 is drawn on the right. The lowest indexed bit is usually the lsb (Least Significant Bit, the one that if changed would cause the smallest variation of the represented value). General layout Binary floating-point numbers are stored in a Sign-magnitude form where the Most Significant Bit is the Sign Bit , ''exponent'' is the Biased exponent, and ''"fraction"'' is the Significand minus the ''most significant bit''. Exponent biasing The exponent is biased by 2''e''−1−1. See also Excess-''N'' . Biasing is done because exponents have to be Signed Values in order to be able to represent both tiny and huge values, but Two's Complement , the usual representation for signed values, would make Comparison harder. To solve this the exponent is biased before being stored, by adjusting its value to put it within an unsigned range suitable for comparison. For example, to represent a number which has exponent of 17, ''exponent'' is 17 + 2''e''−1−1. Assuming ''e'' = 8, the exponent is equal to 17 + 128 − 1 = 144. Cases The most significant bit of the Significand (not stored) is determined by the value of ''exponent''. If ''exponent'' , the most significant bit of the ''significand'' is 1, and the number is said to be ''normalized''. If ''exponent'' is 0, the most significant bit of the ''significand'' is 0 and the number is said to be ''de-normalized''. Three special cases arise: # if ''exponent'' is 0 and ''fraction'' is 0, the number is ±0 (depending on the sign bit) # if ''exponent'' = and ''fraction'' is 0, the number is ± Infinity (again depending on the sign bit), and # if ''exponent'' = and ''fraction'' is not 0, the number being represented is Not A Number (NaN) . This can be summarized as: Single-precision 32 bit A Single-precision binary floating-point number is stored in 32 Bit s. The exponent is biased by in this case (Exponents in the range −126 to +127 are representable. See the above explanation to understand why biasing is done). An exponent of −127 would be biased to the value 0 but this is reserved to encode that the value is a denormalized number or zero. An exponent of 128 would be biased to the value 255 but this is reserved to encode an infinity or not a number (NaN). See the chart above. For normalised numbers, the most common, ''exponent'' is the biased exponent and ''fraction'' is the Significand minus the most significant bit. The number has value v: v = s × 2e × m Where s = +1 (positive numbers) when the sign bit is 0 s = −1 (negative numbers) when the sign bit is 1 e = Exp − 127 (in other words the exponent is stored with 127 added to it, also called "biased with 127") m = 1.fraction in binary (that is, the significand is the binary number 1 followed by the radix point followed by the binary bits of the fraction). Therefore, 1 ≤ m < 2. In the example shown above, the sign is zero, the exponent is −3, and the significand is 1.01 (in binary, which is 1.25 in decimal). The represented number is therefore +1.25 × 2−3, which is +0.15625. Notes: # Denormalized numbers are the same except that e = −126 and m is 0.fraction. (e is NOT −127 : The fraction has to be shifted to the right by one more bit, in order to include the leading bit, which is not always 1 in this case. This is balanced by incrementing the exponent to −126 for the calculation.) # −126 is the smallest exponent for a normalized number # There are two Zeroes, +0 (s is 0) and −0 (s is 1) # There are two Infinities +∞ (s is 0) and −∞ (s is 1) # NaNs may have a sign and a fraction, but these have no meaning other than for diagnostics; the first bit of the fraction is often used to distinguish ''signaling NaNs'' from ''quiet NaNs'' # NaNs and Infinities have all 1s in the Exp field. # The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are #: ±2−149 ≈ ±1.4012985 # The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the Exp field and 0 in the fraction field) are #: ±2−126 ≈ ±1.175494351 # The finite positive and finite negative numbers furthest from zero (represented by the value with 254 in the Exp field and all 1s in the fraction field) are |
|
|