Tensilica Floating Point: Small, Similar Cycles and Lower Power

5 Oct 2016 • 6 minute read

When I first started programming, the first programming language I learned was Fortran IV. In that era, learning to program at that age was rare, since the only computers that existed were mainframes. This was before the minicomputer, let alone the various "personal computers." I was 14 years old. Fortran was one of the few programming languages that existed (before C or Pascal, for example) so that was what introductory programming courses tended to use. This was also the era of the punched card, interactive terminals did not yet exist, let alone graphic displays.

Since Fortran was targeted at numerical work (Fortran stands for FORmula TRANslation), the exercises tended to be in that domain. You can use Fortran for other stuff (I wrote a program to solve mate in two moves for chess problems, for example) but when you start, the easiest stuff to do is things like programming the iterative Newton-Raphson algorithm to find polynomial roots. So immediately you are plunged into using floating point (REAL in Fortran-speak).

Talking of Newton-Raphson, here is a geeky aside. Fact 1: Fortran allows variables to be declared implicitly, just by mentioning them unexpectedly. If they begin with I-N then they are integers; if they don't, they are floating point. Fact 2: Fortran's loop statement is called a do-loop and it starts with a statement like "do 123 k = 1,3" (actually it would all be upper case, since this predated lower case on computers too) which means execute the instructions down to label 123 with k taking the values 1, then 2, then 3. Put these facts together and here's how NASA lost a spacecraft due to a single character error. A programmer typed "do 123 k = 1.3" (with a period instead of a comma). The compiler interpreted this, not as a mistyped do-loop, but (correctly) as declaring a floating-point variable "do123k" (spaces were ignored in Fortran) and assigning it the value 1.3 and then never using it again. The effect was that instead of going around the loop 3 times, calculating an accurate value, the loop wasn't really a loop at all, and the instructions were executed just once, getting a rough approximation. Too rough, as it turned out. Automatic declaration of variables has been considered a "bad thing" in subsequent programming languages. End of geeky aside.

Floating point is very easy to use since the numbers can be almost arbitrarily large or arbitrarily small. There are only a couple of gotchas that you learn early on. The first is that you can't really compare two floating-point numbers for equality. In all except the simplest calculations, two numbers will never be exactly equal and instead you need to make a decision as to how close they need to be and then compare whether the difference is smaller than that. The second gotcha is that adding a tiny number to a large number may be equivalent to doing nothing. This is because of the way floating point works under the hood. Numbers are represented as a mantissa and an exponent, but if the exponents are too different, then when the smaller one is converted to the larger exponent, the mantissa gets shifted so much that it is zeroed out.

Other than those two gotchas (and some less significant ones), programming in floating point is straightforward since you don't have to worry about the numbers getting too big. DSP algorithms have historically been programmed in fixed point. This is actually programming in integers, but knowing which part of the integer is really an integer and which part is really a binary fraction. It is the programmers job to make sure that the numbers don't get too big and overflow.

One other gotcha was that prior to 1985, different manufacturers implemented details of floating point differently (in particular how computations were rounded). The IEEE 754 standard defined all the details and soon after, all floating-point units (FPUs) were IEEE 754 compliant and would get exactly the same results on the same computation.

In practice, the way DSP is done today is largely:

The signal-processing expert develops the algorithms in floating point in MATLAB (from The MathWorks)
Give the algorithm to the implementation programmers and have them translate it into fixed point for the DSP (or sometimes even RTL)

So why not use floating point in the DSP? Two reasons. The first is that until recently, DSPs didn't support floating point so it wasn't an option. The second is that it was slow (in terms of the number of clock cycles required) and power hungry (for the same reason). So it was worth biting the bullet and working out the details of how to move the algorithm into a fixed-point implementation. These two things went together: since FPUs were big, slow, and needed too much power, DSPs didn't have them and programmers had no choice but to use fixed point. But modern DSPs now have FPUs. Should you use them?

At the recent Linley Processor Conference, Cadence's Dror Maydan gave a presentation titled As Embedded Floating Point Becomes Ubiquitous, What Are Your Options? At the top level, the options are not that complicated: fixed point or floating point.

It turns out that floating point gets a bad rap in modern implementations. Instead of being big, slow, and power-hungry, floating point turns out to be small, with similar cycle counts, and lower power.

FIrstly, the area penalty. On the Tensilica Fusion G3, Vision P5, and HiFi 3 (which are the latest version of the cores), the area penalty goes from 10-15%. On the MACs per cycle, the numbers are the same for fixed and floating point. On power, the HiFi 3 audio DSP is 30% lower and the Fusion G3 is 15% lower power.

Cadence recently announced a new processor, the Tensilica Xtensa LX7. Despite the number, this is actually a 12^th generation Tensilica Xtensa base processor architecture. It increases floating-point throughput from 2 to as many as 64 FLOPS per cycle. It is also the brains inside the latest Vision P6, BBE64EP ConnX DSP, and the Fusion G3, all of which have been previously (but recently) announced.

At the recent HotChips conference in Cupertino, Microsoft gave details on their HoloLens HPU (holographic processing unit). It contains a 28nm chip with 24 Tensilica DSP cores along with 8MB of cache and 65M additional logic gates. One of the reasons they selected Tensilica was flexibility: Microsoft added 300 custom instructions to the core using the Tensilica Instruction Extension (TIE). Nick Baker, a distinguished technologist at Microsoft, said during the talk that it gets an over 200X speedup over a software-only version.

The raw performance of the Xtensa LX7, and the processors built on top of it, is high. See the table below for some datapoints.

Watch a "Whiteboard Wednesday" on the increased need for floating point presented by Pushkar Patwardhan, a DSP architect at Cadence.

Read more about the Tensilica cores, including the Xtensa LX7.

Previous: 1,168 Reasons to Watch Training Bytes