The IEEE 754-1985 specifies the floating point format of 32-bit numbers, which is known as the float type in many languages. The format for a floating point number is much more complicated than the integer format because of the sheer volume of numbers it can represent.
A 32-bit float has three fields, in this order:
- One-bit sign field.
- Eight-bit exponent field.
- Twenty-three bit mantissa field. This is where lesser significant bits are stored at.
The author says to play with the format some with the provided code, which was actually really helpful! Here are a few things that I learned:
- The exponent bit starts at 128 for a value of 1. Then, it moves backwards or forwards depending on the decimal point.
- The sign & magnitude scheme is used for this. As a result, all numbers (including 0) can be represented in the positive and negative forms.
The formula for decoding a float is quite simple besides these cases though:
(1.0 + mantissa-field / 0x800000) * 2^(exponent-field-127)
. From this, we use the sign bit to make this a positive or negative number.
When decoding the format, there is an implied leading 1 in front of the mantissa field if the value is between -126 and 127. if the value is -126, then do not add the leading one.
There are a few special cases that are interesting to talk about. First, if the exponent field is 255 & the mantissa is zero, then the value is infinity. And if the exponent field is 255 & the mantissa is non-zero, then this is Not A Number (NaN).
Why the implicit 1 though? We know that the first bit of the binary pattern will be a 1. Otherwise, the exponent would be smaller. This saves a bit and allows for the usage of bigger numbers!
This is a huge chain of posts, that I am real excited to learn all about! Floats have caused issues for developers for many years; I am hyped to finally understand why. More on these articles will be coming!