Episode 3.07 – Introduction to Floating Point Binary and IEEE 754 Notation

Welcome to the Geek Author series on Computer Organization and Design Fundamentals. I’m David Tarnoff, and in this series we are working our way through the topics of Computer Organization, Computer Architecture, Digital Design, and Embedded System Design. If you’re interested in the inner workings of a computer, then you’re in the right place. The only background you’ll need for this series is an understanding of integer math, and if possible, a little experience with a programming language such as Java. And one more thing. Our topics involve a bit of figuring, so it might help to keep a pencil and paper handy.

In our last episode, we discussed how the digits of fixed-point binary numbers represent different powers of two with the power descending by one for each digit position that we move from left to right. A point is added to separate the positions with non-negative exponents from the positions with negative exponents.

Sometimes, however, the numbers we’re trying to represent are too large or too small to be represented with fixed-point notation. For example, it isn’t practical to write decimal numbers such as the Avagadro number, six zero two followed by twenty-one zeros, or the width of a hydrogen atom in meters, 0.000000000106. That’s why we have scientific notation.

In scientific notation, the decimal value -342,000,000 is represented as -3.42 x 108. Note that there are three components to this representation: the sign, the mantissa (the part that represents the significant digits of the number), and an integer exponent. The exponent represents the number of times we multiply or divide the mantissa by the base. In proper scientific notation, the integer portion of the mantissa must be greater than or equal to one and less than ten, i.e., a single non-zero integer digit.

Scientific notation can be applied to any base, including binary. Changing the base not only changes the numerals available to represent the mantissa, it also changes the “times” part that comes after the mantissa. Remember that when you multiply or divide a number by its base, it has the effect of moving the point right or left. That means the exponent simply represents the number of positions right or left that the point must be moved. An exponent of five means that the point must be moved five positions to the right. An exponent of negative three means that the point must be moved three positions to the left.

Most programming languages require variables to be declared with a type before using them. In Java, for example, a variable declared as an int is stored in 32-bit twos complement representation. For a variable meant to accommodate a real number, programming languages such as Java declare them using keywords like float or double. For these formats, computers rely on scientific notation in base-2.

Just like decimal, scientific notation in binary has a sign identifying the number as positive or negative, the mantissa with a single non-zero digit to the left of the binary point and the fraction to the right, and an exponent that in this case uses two as the multiplier to move the binary point right or left. For example, the binary number 10001.011 would be represented in base-2 scientific notation as 1.0001011 x 24. Note that in binary, there is only one option for the single non-zero digit to the left of the binary point. It must be a one.

In order to reliably share floating-point binary data between devices, a standard was created, the IEEE 754 Standard for Floating-Point Arithmetic. This standard is used to represent real numbers on the majority of contemporary computer systems. The two most common formats are a 32-bit pattern to represent single-precision numbers and a 64-bit pattern to represent double-precision numbers. Both of these formats partition the binary pattern into three parts with each part representing a different scientific notation component. The order and use of each partition is the same in both formats. Only the number of bits used to represent each part varies.

The first or left-most bit of both formats is used to represent the sign of the number. A zero in this position means the number is positive while a one means the number is negative. The next group of bits, which we will call E, represents the exponent in offset notation. For 32-bit single-precision notation, this exponent is eight bits long with a bias of 127 while 64-bit double-precision notation uses an eleven-bit exponent with a bias of 1023.

The third and right-most group of bits represents the fraction from which the mantissa is generated. Single-precision uses twenty-three bits here while double-precision uses fifty-two. Here we will represent the fraction with the letter F. There can be some confusion here. Remember that the integer part of any mantissa has a lower limit of one and an upper limit of one less than the base. In binary, this integer portion must always be one. Because of this, the integer portion of the mantissa in binary in not stored as part of the binary pattern, only the fraction is.

To convert from IEEE 754 notation to binary scientific notation, we first partition the 32- or 64-bit number into its three components, then fill in the blanks of the corresponding scientific notation format. For single-precision numbers, the equation including the operation to take care of the offset or bias is ±1.F x 2(E – 127). For double-precision numbers, the equation including the operation to take care of the offset or bias is ±1.F x 2(E – 1023). In both cases, F is preceded with the implied ‘1’ followed by a binary point.

Let’s convert the 32-bit single-precision IEEE Standard 754 pattern 1 10000101 01000000000000000000000 to scientific notation. First, we partition the 32-bit number into its components. The most significant bit is a one, which means that this is a negative number. The eight bits following the sign bit represent the exponent, E. In this example, E equals 10001001, which in decimal equals 128 + 4 + 1 or 133. Remember that this is in offset notation, so we subtract 127 from 133 to get the exponent six.

The remaining 23 bits are the fraction of the mantissa. They are 01 followed by twenty-one trailing zeros. That gives us our value in scientific notation: -1.01 x 26. Note that the trailing zeros have been discarded. To determine the value this represents in fixed-point notation, we move the binary point six positions to the right to get -1010000. In decimal, -1010000 equals -80.0.

Now let’s go the other way. Create the 32-bit single-precision IEEE 754 representation of the binary number +0.0001011. The first step is to put this number into scientific notation with a single 1 to the left of the binary point. Note that this is done by moving the binary point four positions to the right to get +1.011 x 2-4. This is a positive number, so the sign bit is zero. Next, the exponent is –4. Since we’re storing this in offset notation with a bias of 127, add 127 to -4 to get 123. One hundred and twenty-three in 8-bit binary is 01111011.

For the fraction, we only store the digits following the binary point. In this case, the digits to the right of the binary point are 011. We add twenty trailing zeros to the right of 011 since the portion of IEEE 754 single-precision notation that stores the fraction requires exactly twenty-three bits. Putting these patterns in the order of sign bit followed by 8-bit exponent followed by 23-bit fraction gives us the IEEE 754 single-precision representation: 0 01111011 01100000000000000000000.

IEEE 754 defines a few special cases. First, you may have noticed that it is impossible to represent zero since the integer part of the mantissa is assumed always to be one. Zero is represented by setting both E and F to zero. The sign bit is ignored. IEEE 754 also allows us to represent infinite. Infinite is represented by setting E to all ones (255), and clearing F to all zeros. The sign bit distinguishes positive infinite from negative infinite. Lastly, if E is set to all ones and F is set to a non-zero value, we have what is called NaN or Not a Number. This means that the value we’ve attempted to store is not representable. For example, an attempt to take the square root of a negative floating-point number should result in an NaN value.

In earlier episodes, we discussed the importance of comparing and sorting values on the computer. The specification of IEEE 754 takes this into account. First, let’s look at the arrangement of the elements. Ignoring sign, what has the greatest effect on relative magnitude? It turns out that the exponent is the most important element. Regardless of the mantissa, larger exponents mean larger values. That’s why the exponent is placed in a position of greater significance than the fraction. If two numbers have the same sign and the same exponent, then the fractions are compared.

This also explains why offset notation is used instead of twos complement for the exponent. In Episode 3.5, “Introduction to Offset or Biased Notation,” we showed how offset notation is easier to sort. Remember that twos complement sets the most significant digit to one for its negative numbers. This makes negative numbers appear larger than positive numbers. Since negative exponents mean smaller values in IEEE 754, offset notation is used for the exponent so that floating-point numbers can be compared using the same hardware used to compare unsigned integers.

That brings us to the end of our discussion regarding floating-point notation. For transcripts, links, or other podcast notes, please check us out at intermation.com where you will also find links to our Instagram, Twitter, Facebook, and Pinterest pages. Until then remember that while the scope of what makes a computer is immense, it’s all just ones and zeros.