Domain-Specific Computing 3: Specialized Processors

15 Mar 2019 • 8 minute read

The last two days' posts have looked at the development of general-purpose processors, and why we are now in the age of special-purpose processors, including one particular type, the GPU. Today, I take a look at some of the other aspects of domain-specific computing.

VLIW

VLIW stands for "very large instruction word". What that means is that each "instruction" executed actually consists of a number of instructions (think six or eight, not hundreds). Instead of the hardware trying to find instruction-level parallelism in the hardware, like I described yesterday, the onus is on the compiler to do that, and pick which instructions should be executed together, and take care of all the dependencies.

It was thought that this would be a good approach for general-purpose processors, and Intel and HP in particular invested a lot of money in Itanium. The project was kicked off in 1989, and was an attempt to solve the same problem as the out-of-order processing approach I described yesterday: how to get more than one instruction per cycle. Itanium executed up to six instructions per cycle. The biggest problem wasn't building the hardware to do this, the problem was that the compilers turned out to be impossible to write. For this to be a good solution, most instructions words needed to be full, providing something for all six execution units to do. At the time, this was called EPIC for Explicitly Parallel Instruction Computing. However, it turned out to be very hard to find enough parallelism to express explicitly.

Code density was a big issue, and it was very low with VLIW instructions that were not filled. It was even a problem for RISC computers such as the early Arm processors. Both Wally Rhines (still at TI) and old-timers at Arm have told me about a big meeting that took place at Nokia where this turned out to be the big hurdle to Nokia switching to Arm. On the plane from Helsinki to London, everyone worked out how to add a compact 16-bit instruction set to go along with the existing 32-bit instruction set. This became "thumb" and the ARM7TDMI (T is for thumb) went on to be the standard processor for 2G mobile, such as GSM. You can read that story in more detail in my post The Design that Made ARM, based on my interview with the project lead...Simon Segars. You might have heard of him now that he is Arm's CEO.

In parallel with the work on Itanium, computer architects developed all the infrastructure for modern processors that I described yesterday, with out-of-order execution, branch prediction, and large cache hierarchies. One advantage of this approach over VLIW is that it is more dynamic. Discovering the instruction parallelism on the fly works well even with branches being taken sometimes and not other times, for example. Instruction packing in loops and conditionals turned out to be very hard. Branch prediction, loop unrolling, and handling cache misses, were other challenges. Another competitive disadvantage is the code-density issue I just mentioned. Any processor that has a cycle where an execution unit does nothing is wasting that resource. But VLIW instruction sets also waste the space required to tell the resource to do nothing.

So out-of-order execution turned out to be a better approach than VLIW for general-purpose computing.

For special-purpose computing, such as digital signal processing (DSP) algorithms, vision processing, and neural net inference, however, VLIW is very effective.

DSP

I'm going to call special-purpose VLIW processors DSPs. In one sense, most of them process digital signals, but in another sense, they are unlike traditional DSPs. Since many of the latest processors are focused on aspects of deep learning, artificial intelligence, and neural networks, the digital signal processing name is a bit misleading. There probably should be another word, but there isn't one that has gained acceptance.

One change since the Itanium VLIW era has been more flexible instruction lengths. Instead of having a fixed number of slots, and requiring unused slots to be filled with no-ops, the instruction length can vary. This avoids code-bloat that early implementations of VLIW were famous for.

The other big advantage is that DSP algorithms typically consist of lots of nested loops that can be statically scheduled. The compiler bundles multiple operations for parallel execution, and can do deep analysis during compile-time to increase the amount of parallelism that can be achieved.

The result is a Cambrian explosion of processors for these applications, not just from Cadence, but from other IP companies, and from many fabless semiconductor company startups. Just at Hot Chips last year, a lot of the presentations were about processors to accelerate neural network training or neural network inference. The most well-known, perhaps, is Google's TPU, which you can read about in detail in my post Inside Google's TPU. Or you can read about the latest from Arm and NVIDIA in my post HOT CHIPS: Some HOT Deep Learning Processors. Yes, Cadence has one too—read my post The New Tensilica DNA 100 Deep Neural-Network Accelerator.

Tensilica

These blog posts have mostly been generic, about the trends that have made the move to domain-specific computing inevitable. One processor family that has been affected by these trends is Tensilica, both before and after its acquisition by Cadence.

Tensilica has several families of processors with different attributes:

HiFi DSP for audio processing
Vision DSP for vision processing
Fusion and DNA DPSs for neural network processing
ConnX DSP for cellular connectivity, radar, and lidar

Tensilica originally developed a processor called Xtensa that could be customized for different applications. However, it struggled to get adoption outside of specialized groups. The problem was that, for example, audio engineers wanted to just worry about the audio stuff, and not so much about processor architecture. They were happy to add a few specialized instructions when they could see a bottleneck they could break, but starting from scratch gave them more flexibility than they wanted. It was when Tensilica created base processors like the HiFi and Vision that they started to get real acceptance. These were already a good match for their domain and could be used unchanged. Plus, there was a wide variety of software such as codecs from partners that ran on them. If that wasn't enough, additional customization could be added.

I remember talking to another customizable processor company years ago, in my VaST days, and they had similar problems selling. The customer would say "what's the performance?" and the reply was "it depends what you build". The customer would then say "just tell me what performance a 32-bit processor will give me before I do anything fancy. Otherwise, I don't even know whether to bother looking at it."

You can still take Xtensa and roll your own, and sometimes people do. But most designs take one of the cores above and either use it unchanged or with some incremental customization. It turns out that VLIW may not be a good solution for general-purpose processors, at least compared with the approach that has been built up over the last couple of decades, but it is wonderful for this type of processing that involves a lot of CPU-intensive computation with a well-understood control structure and memory access structure.

Dark Silicon

So far I've talked about special-purpose processors as if they are always on. They are a better way to get peak performance than using a general-purpose processor. But there is another reason to use special-purpose processors, that goes under the catchy name "dark silicon." A multicore processor may dissipate so much power that all the cores cannot be turned on simultaneously, at least for extended periods. Instead, some of the cores need to be kept dark, and powered down (or dim, and run at a lower frequency).

Since all the cores in a multicore processor are the same, there is no point in adding the last core if it can never be powered up. Or, which comes to the same thing, it can only be powered up if another of the identical cores is kept powered down. There are some thermal reasons that you can eke out some extra performance by moving workloads from one core to another to allow the unused ones to cool down, but the problem of dark silicon remains. The dark silicon may, in fact, be cores that were never added to the design because they couldn't be powered up, rather than cores that were added unwisely but can't be used.

However, with domain-specific cores, that is not true. Firstly, since they are domain specific, they typically use a lot less power, and/or deliver more performance than the general-purpose core for the specialized workload for which they were designed. If that wasn't the case, there would be no reason to create the core in the first place. But now the cores are not all the same. There may be no reason to add the last core to a multicore system if it is the same as all the other cores, but there can be good reasons to add specialized cores for specialized functions. An autonomous car chip might contain specialized vision processing, radar processing, neural network processing, and more. All of these are poor matches for general-purpose processors. Unlike with the multicore general-purpose processor, adding more specialized cores is a good idea even if all the cores cannot be powered up together. When all the cores are the same, there is no point in having more if they can't all be powered up. If they are all different, there are big advantages to running a varied workload on specialized processors optimized for the various pieces. The overall power and performance will be higher even with a limited power envelope.

In effect, dark silicon can be harnessed for domain-specific computing.

Fun true fact: there are a lot of glowing silicone bracelets for sale, where silicone is misspelled silicon. So they are advertised as "glow in the dark silicon bracelets", and Google finds them all if you try and search for "dark silicon".

Summary

The view from 50,000 feet is simple:

General-purpose processors have run out of steam (except for adding more cores) due to power constraints
So the only way to get more performance is through domain-specific processors

It is truly the age of domain-specific computing.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.