Custom Instructions in Tensilica: Wearing a TIE Makes You Smarter

12 Jun 2020 • 5 minute read

Tensilica has a number of different product families targeted at different applications from audio, via video, to deep learning. I've written posts about all of these during the last year. The most recent in each domain were:

Audio: HiFi DSPs - Not Just for Music Anymore
VIdeo: It's a SLAM Dunk Programming the Vision Q7 DSP
Radar: Tensilica ConnX B20 for 5G, and Automotive Radar/Lidar
Deep Learning: The New Tensilica DNA 100 Deep Neural-network Accelerator

Xtensa

Underneath the hood, all of these processors are built on top of the Tensilica Xtensa configurable processor architecture. This consists of a VLIW DSP, which stands for very long instruction word digital signal processor. VLIW is an architecture that allows multiple instructions to be dispatched on any given clock cycle. Unlike in a modern speculative execution microprocessor, this is handled at compile time and so doesn't require a lot of infrastructure to deduce instruction-level parallelism on the fly. Since these processors typically execute a single program, or one of just a few, this can all be dealt with in advance.

The Xtensa architecture is configurable in a number of different ways. First, optional blocks, such as a floating-point unit (FPU) can be added...or not. Multipliers of various sizes can be added...or not. The Xtensa system then creates the processor with the selected options. Maybe it is obvious, but it is worth emphasizing that this entails more than just generating the RTL for the processor. The compiler needs to be aware of which options were selected—you can't send floating-point operations to the FPU if there isn't one. Perhaps, you want a cycle-accurate ISS model. A test program for sure. Software debug support. All this happens automatically.

The fact that Xtensa was used to build the processors I opened this post with, means that all the specialized processor families can also be further optimized. It's not either/or, where you either take a fixed block we designed, or you start from scratch and roll your own processor. The diagram below shows the Xtensa processor pipeline:

There are many reasons to optimize a processor. In fact, learning from all those internet click-bait experts who say that listicles are the way to go, here are 10 reasons:

Future-proof your design, so you can incorporate new algorithms and standards without needing to update RTL
Avoid lengthy RTL verification time since the Xtensa system provides correct-by-construction functionality
Reduce energy consumption compared to running on your main processor
Get a unique and proprietary processor, making it harder for competitors to copy your design
Use an automated process: build your basic processor and then optimize and accelerate
Avoid the I/O bandwidth bottlenecks by bypassing the main bus
Optimize in C, no need for assembly language
Get better area/performance tradeoffs
Make your design team more productive
Optimize...because you can, with automated tools (and you don't need to be a processor designer)

Tensilica Instruction Extension

Tensilica Instruction Extension, or TIE, is the ultimate in reconfiguration. You can add custom instructions in a way that doesn't break the Xtensa system. Teams designing the most complex systems, from augmented reality to automotive radar, use this approach to get a big boost in performance without requiring a lot more power or area.

To explain what I mean by "not breaking the Xtensa system" I can't do better than to quote from the TIE product page:

Adding TIE instructions to a Tensilica processor core never compromises the underlying base Xtensa instruction set, thereby ensuring the availability of a robust ecosystem of third-party application software and development tools. All configurable, extensible Xtensa processors are always compatible with major operating systems, debug probes, and ICE solutions. In addition, they always come with an automatically generated, complete software development toolchain including an advanced integrated development environment based on the ECLIPSE framework, a world-class compiler, a cycle-accurate SystemC-compatible instruction set simulator, and the full industry-standard GNU toolchain.

Let's look at an example. If you don't know C then you can just skip over the code, the basic message will become clear.

Her's a function pop_count that counts how many bits are set in a word:

unsigned int pop_count (unsigned int x) {
unsigned int y=0; unsigned int k;
for (k=0; k<32; k++) { if ((x&1)==1) y++; x = x>>1 }
return y
}

This takes at least 70 cycles to run. You can do better by writing better C. For example, a trick I learned over 30 years ago is to mix logical and arithmetic instructions. If you logically-and a number with one less arithmetically, you will remove the rightmost one in the word. So you can reduce the number of cycles to twice the number of ones in the word, plus a little overhead. But it is not complicated to count the number of bits that are one in a single cycle using just combinational gates. Here's how you use TIE to express that:

operation pop_count {out AR co, in AR ci}{}{
wire [3:0] a0 = ci[0] + ci[1] + ci[2] + ci[3] + ci[4] + ci[5] + ci[6] + ci[7];
wire [3:0] a1 = ci[8]+ci[9] + ci[10] + ci[11] + ci[12] + ci[13] + ci[14] + ci[15];
wire [3:0] a2 = ci[16]+ci[17]+ci[18] + ci[19] + ci[20] + ci[21] + ci[22] + ci[23];
wire [3:0] a3 = ci[24]+ci[25]+ci[26] + ci[27] + ci[28] + ci[29] + ci[30] + ci[31];
wire [5:0] sum = a0 + a1 + a2 + a3;
assign co = {26’b0, sum};
}

This reduces the operation to a single operation that operates in a single cycle, a 70X speedup from the original code.

That is actually a very simple example. You can also add registers to the register file, or even another register file. You can add additional I/O interfaces—for example, to connect up to a custom RTL block implementing some "secret sauce" function. For a realistic example, see my post HOT CHIPS: Microsoft Hololens 2.

More Details

Read the more detailed whitepaper Ten Reasons to Optimize a Processor. They are the same ten reasons as I gave above, but with a lot more motivation and detail.

A deeper dive into TIE is in the whitepaper TIE Language—The Fast Path to High-Performance Embedded SoC Processing. In addition to the pop_count example, there are examples showing instruction fusion, using Flexible Length Instruction eXtension (FLIX) to pack variable-length instructions, adding registers, and more.

For software engineers, you might want to read the whitepaper Tensilica Software Development Toolkit (SDK).

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.