Primary

၊၊||၊|။

Floating-Point Number (Float) ○꠹｜Definition｜1st｜20251122125539-00-⌔
Floating-point arithmetic - Wikipedia#Floating-point_numbers

Floating-point numbers

A number representation specifies some way of encoding a number, usually as a string of digits.

There are several mechanisms by which strings of digits can represent numbers. In standard mathematical notation, the digit string can be of any length, and the location of the radix point is indicated by placing an explicit “point” character (dot or comma) there. If the radix point is not specified, then the string implicitly represents an integer and the unstated radix point would be off the right-hand end of the string, next to the least significant digit. In fixed-point systems, a position in the string is specified for the radix point. So a fixed-point scheme might use a string of 8 decimal digits with the decimal point in the middle, whereby “00012345” would represent 0001.2345.

In scientific notation, the given number is scaled by a power of 10, so that it lies within a specific range—typically between 1 and 10, with the radix point appearing immediately after the first digit. As a power of ten, the scaling factor is then indicated separately at the end of the number. For example, the orbital period of Jupiter’s moon Io is 152,853.5047 seconds, a value that would be represented in standard-form scientific notation as 1.528535047 × 10 $^{5}$ seconds.

Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:

A signed (meaning positive or negative) digit string of a given length in a given radix (or base). This digit string is referred to as the significand, mantissa, or coefficient.¹ The length of the significand determines the precision to which numbers can be represented. The radix point position is assumed always to be somewhere within the significand—often just after or just before the most significant digit, or to the right of the rightmost (least significant) digit. This article generally follows the convention that the radix point is set just after the most significant (leftmost) digit.

A signed integer exponent (also referred to as the characteristic, or scale),² which modifies the magnitude of the number.

To derive the value of the floating-point number, the significand is multiplied by the base raised to the power of the exponent, equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative.

Using base-10 (the familiar decimal notation) as an example, the number 152,853.5047, which has ten decimal digits of precision, is represented as the significand 1528535047 together with 5 as the exponent. To determine the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by 10 $^{5}$ to give 1.528535047 × 10 $^{5}$ , or 152,853.5047. In storing such a number, the base (10) need not be stored, since it will be the same for the entire range of supported numbers, and can thus be inferred.

Symbolically, this final value is:

$\frac{s}{b ^{p - 1}} \times b^{e},$

where s is the significand (ignoring any implied decimal point), p is the precision (the number of digits in the significand), b is the base (in our example, this is the number ten), and e is the exponent.

Historically, several number bases have been used for representing floating-point numbers, with base two (binary) being the most common, followed by base ten (decimal floating point), and other less common varieties, such as base sixteen (hexadecimal floating point³⁴⁵), base eight (octal floating point⁶⁴⁷³⁸), base four (quaternary floating point⁹⁴¹⁰), base three (balanced ternary floating point⁶) and even base 256⁴¹¹ and base 65,536.¹²¹³

A floating-point number is a rational number, because it can be represented as one integer divided by another; for example 1.45 × 10 $^{3}$ is (145/100)×1000 or 145,000/100. The base determines the fractions that can be represented; for instance, 1/5 cannot be represented exactly as a floating-point number using a binary base, but 1/5 can be represented exactly using a decimal base (0.2, or 2 × 10 $^{- 1}$ ). However, 1/3 cannot be represented exactly by either binary (0.010101…) or decimal (0.333…), but in base 3, it is trivial (0.1 or 1×3 $^{- 1}$ ). The occasions on which infinite expansions occur depend on the base and its prime factors.

The way in which the significand (including its sign) and exponent are stored in a computer is implementation-dependent. The common IEEE formats are described in detail later and elsewhere, but as an example, in the binary single-precision (32-bit) floating-point representation, $p = 24$ , and so the significand is a string of 24 bits. For instance, the number π’s first 33 bits are:

$11001001000011111101101 \underline{0} 10100010 0.$

In this binary expansion, let us denote the positions from 0 (leftmost bit, or most significant bit) to 32 (rightmost bit). The 24-bit significand will stop at position 23, shown as the underlined bit 0 above. The next bit, at position 24, is called the round bit or rounding bit. It is used to round the 33-bit approximation to the nearest 24-bit number (there are specific rules for halfway values, which is not the case here). This bit, which is 1 in this example, is added to the integer formed by the leftmost 24 bits, yielding:

$11001001000011111101101 \underline{1} .$

When this is stored in memory using the IEEE 754 encoding, this becomes the significand s. The significand is assumed to have a binary point to the right of the leftmost bit. So, the binary representation of π is calculated from left-to-right as follows:

$(n = 0 \sum p - 1 bit_{n} \times 2^{- n}) \times 2^{e} = (1 \times 2^{- 0} + 1 \times 2^{- 1} + 0 \times 2^{- 2} + 0 \times 2^{- 3} + \dots + 1 \times 2^{- 23}) \times 2^{1} \approx 1.57079637 \times 2 \approx 3.1415927$

where p is the precision (24 in this example), n is the position of the bit of the significand from the left (starting at 0 and finishing at 23 here) and e is the exponent (1 in this example).

It can be required that the most significant digit of the significand of a non-zero number be non-zero (except when the corresponding exponent would be smaller than the minimum one). This process is called normalization. For binary formats (which uses only the digits 0 and 1), this non-zero digit is necessarily 1. Therefore, it does not need to be represented in memory, allowing the format to have one more bit of precision. This rule is variously called the leading bit convention, the implicit bit convention, the hidden bit convention,⁶ or the assumed bit convention.

Printed 2026-06-28.

(echo:: @ ᯤ)

Footnotes

The significand of a floating-point number is also called mantissa by some authors—not to be confused with the mantissa of a logarithm. Somewhat vague, terms such as coefficient or argument are also used by some. The usage of the term fraction by some authors is potentially misleading as well. The term characteristic (as used e.g. by CDC) is ambiguous, as it was historically also used to specify some form of exponent of floating-point numbers. ↩

The exponent of a floating-point number is sometimes also referred to as scale. The term characteristic (for biased exponent, exponent bias, or excess n representation) is ambiguous, as it was historically also used to specify the significand of floating-point numbers. ↩

Zehendner, Eberhard (Summer 2008). “Rechnerarithmetik: Fest- und Gleitkommasysteme” (PDF) (Lecture script) (in German). Friedrich-Schiller-Universität Jena. p. 2. Archived (PDF) from the original on 2018-08-07. Retrieved 2018-08-07. [1] (NB. This reference incorrectly gives the MANIAC II’s floating point base as 256, whereas it actually is 65536.) ↩ ↩²

Beebe, Nelson H. F. (2017-08-22). “Chapter H. Historical floating-point architectures”. The Mathematical-Function Computation Handbook - Programming Using the MathCW Portable Software Library (1st ed.). Salt Lake City, UT, USA: Springer International Publishing AG. p. 948. doi:10.1007/978-3-319-64110-2. ISBN 978-3-319-64109-6. LCCN 2017947446. S2CID 30244721. ↩ ↩² ↩³ ↩⁴

Hexadecimal (base-16) floating-point arithmetic is used in the IBM System 360 (1964) and 370 (1970) as well as various newer IBM machines, in the RCA Spectra 70 (1964), the Siemens 4004 (1965), 7.700 (1974), 7.800, 7.500 (1977) series mainframes and successors, the Unidata 7.000 series mainframes, the Manchester MU5 (1972), the HEP (1982) computers, and in 360/370-compatible mainframe families made by Fujitsu, Amdahl and Hitachi. It is also used in the Illinois ILLIAC III (1966), Data General Eclipse S/200 (ca. 1974), Gould Powernode 9080 (1980s), Interdata 8/32 (1970s), the SEL Systems 85 and 86 as well as the SDS Sigma 5 (1967), 7 (1966) and Xerox Sigma 9 (1970). ↩

Muller, Jean-Michel; Brisebarre, Nicolas; de Dinechin, Florent; Jeannerod, Claude-Pierre; Lefèvre, Vincent; Melquiond, Guillaume; Revol, Nathalie; Stehlé, Damien; Torres, Serge (2010). Handbook of Floating-Point Arithmetic (1st ed.). Birkhäuser. doi:10.1007/978-0-8176-4705-6. ISBN 978-0-8176-4704-9. LCCN 2009939668. ↩ ↩² ↩³

Savard, John J. G. (2018) [2007], “The Decimal Floating-Point Standard”, quadibloc, archived from the original on 2018-07-03, retrieved 2018-07-16 ↩

Octal (base-8) floating-point arithmetic is used in the Ferranti Atlas (1962), Burroughs B5500 (1964), Burroughs B5700 (1971), Burroughs B6700 (1971) and Burroughs B7700 (1972) computers. ↩

Parkinson, Roger (2000-12-07). “Chapter 2 - High resolution digital site survey systems - Chapter 2.1 - Digital field recording systems”. High Resolution Site Surveys (1st ed.). CRC Press. p. 24. ISBN 978-0-20318604-6. Retrieved 2019-08-18. […] Systems such as the [Digital Field System] DFS IV and DFS V were quaternary floating-point systems and used gain steps of 12 dB. […] (256 pages) ↩

Quaternary (base-4) floating-point arithmetic is used in the Illinois ILLIAC II (1962) computer. It is also used in the Digital Field System DFS IV and V high-resolution site survey systems. ↩

Base-256 floating-point arithmetic is used in the Rice Institute R1 computer (since 1958). ↩

Lazarus, Roger B. (1957-01-30) [1956-10-01]. “MANIAC II” (PDF). Los Alamos, NM, USA: Los Alamos Scientific Laboratory of the University of California. p. 14. LA-2083. Archived (PDF) from the original on 2018-08-07. Retrieved 2018-08-07. […] the Maniac’s floating base, which is 2 $^{16}$ = 65,536. […] The Maniac’s large base permits a considerable increase in the speed of floating point arithmetic. Although such a large base implies the possibility of as many as 15 lead zeros, the large word size of 48 bits guarantees adequate significance. […] ↩

Base-65536 floating-point arithmetic is used in the MANIAC II (1956) computer. ↩

Link to original

Secondary

• • •

⏾ Concept Map

Floating-Point Number (Float) ○꠹ Entries

Primary

Floating-Point Number (Float) ○꠹｜Definition｜1st｜20251122125539-00-⌔

Floating-point numbers

Secondary