Google FeedBurner is phasing out its RSS-to-email subscription service. While we are currently working on the implementation of a new system, you may experience an interruption in your email subscription service.
Please stay tuned for further communications.
Get email delivery of the Cadence blog featured here
Floating-point numbers are widely used for numerical calculations, including digital signal processing. In fact, we recently announced a family of floating-point optimized Tensilica DSPs. See my post Tensilica FloatingPoint DSP Family. Prior to 1985, floating-point implementations on different computer systems might get different answers due to subtle differences in rounding, support for infinity, and other implementation details. In 1985, the IEEE issued standard 954 which defines all these things, including the precise format of floating-point numbers. The standard now supports both binary (most common) and decimal formats. I'm only going to talk about binary formats. The standard defines everything in a lot of detail so you will get the same result on any floating-point unit.
The advantage of floating-point numbers compared to integers is the much larger dynamic range, the maximum values that can be represented. There is no free lunch, this range is paid for with reduced precision once the number gets too large for an exact representation in the mantissa. Compared to fixed-point numbers, the big advantage of floating-point representations is that fractions are represented directly and the programmer does not have to keep track of where the "binary point" is all the time. Some of these advantages are similar to the advantages from using "scientific notation", numbers like 6.02×1023 (or 6.02e+23), which would be pretty tedious to write out in full (plus, I don't think we know Avogadro's Number to 23 bits of precision).
Floating-point numbers consist of an exponent and a mantissa. The number of bits allocated to exponent and mantissa varies depending on the precision. There is a sign bit, but also a little cheat since the mantissa always starts with a one-bit so that doesn't need to be represented. Since the sign is explicit, there are both positive and negative floating-point zeros (which we don't get with twos-complement integers, where we have a different anomaly, that the smallest negative number cannot be negated because it is too large to be represented as a positive number).
One of the compulsory courses, when I studied computer science, was Numerical Methods. Things like solving calculus problems with Newton Raphson or manipulating matrices with techniques like successive over-relaxation. This was all done with floating-point arithmetic. The lecturer was actually Maurice Wilkes, the head of the Computer Laboratory and famous for leading the team that designed EDSAC. I also remember him being challenged that computers couldn't really get any faster due to speed of light considerations (this was an era when the memory might be in a box on the other side of the room from the mainframe CPU). He thought for a moment and said "I think computers are going to get a lot smaller". With microprocessors, of course, they have.
I'm amazed to discover that the textbook we used, Numerical Methods that Usually Work, first published in 1970, has been reissued and you can get it on Amazon. Although I remember it had a bright yellow cover back in the day.
Floating-point numbers often exhibit unexpected behavior that is a trap for the unwary:
And, of course, there's always an XKCD for everything:
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.