Domain-Specific Computing 2: The End of the Dark Ages

14 Mar 2019 • 5 minute read

Yesterday looked at how general-purpose computer architecture changed during the period when Moore's Law and Dennard scaling were both working well, and when computer architects were coming up with more and more innovative architectures to get extra performance by using all those extra transistors, what I called the dark ages, since there was no point in doing anything innovative in special-purpose architectures.

The End of the Dark Ages

Two things happened around the same time in the mid-2000s. First, Dennard scaling truly came to an end. Dennard scaling was the fact that you could scale from one process to another, increase the performance, but not increase the power. That depended on most of the parasitic capacitance being gate capacitance. During the Dennard scaling era, clock rates could be increased at each process generation without the power going up unacceptably. But in the mid-2000s, at around 3GHz, that came to an end. Famously, Pat Gelsinger, then CTO of Intel, pointed out that future power densities were approaching those of the core of nuclear reactors or the center of the sun. That wasn't going to happen in a datacenter, let alone a smartphone in your pocket.

From that point on, instead of pushing clock rates up, additional compute power was delivered in the form of more cores. If you look at Patterson and Hennessey's chart, you see that the slope of the graph changes at "end of Dennard scaling" to 23% per year, and then "Amdahl's Law" to 12% per year. Gene Amdahl was another legendary computer architect, first at IBM, and then at the company with his own name...then Trilogy but that's a story for another day. Amdahl's Law is an observation he made that in a parallel system like a multicore processor, the performance improvement is limited by the part that cannot be parallelized, that can't be run on more than one core. For example, if 90% of the code can be run on multiple cores, the absolute maximum speedup is 10X, since that other 10% still takes 10% of the original time.

The second thing that happened was that computer architects had done everything they could think of for improving the performance of a general-purpose processor: pipelines, multilevel caches, out-of-order execution, branch prediction, multicore. Some tweaks could be made, but now we are in the era marked "end of the line" where performance is improving 2-3% per year. If you need to get more performance, you are not going to get it for free just by waiting, like happened in that earlier period. You can have more of everything...more general-purpose cores, more general-purpose processors. What you can't have is more performance for a single task running on a general-purpose processor. It is the end of instruction-level parallelism, in the sense that we don't know how to get more than we have achieved to date.

The New Golden Age of Computer Architecture

If you can't get more performance on a general-purpose processor, then the answer is to use special-purpose processors. This is why it is now the golden age of computer architecture, since novel and innovative architectures matter, and won't just get surpassed a few years later by the improvements in general-purpose computing.

We are now in the era of domain-specific computing.

If we step back and look at the big picture, a general-purpose processor starts to look even less like the right solution in all circumstances. Many algorithms are inherently parallel. In most programming languages, such as C++ or Python, the programmer has to order those operations into a sequential flow (or, at best, very coarse parallelism with a small number of threads/processes). The compiler for the source code then makes really detailed decisions as to exactly what instructions, in what order, should operate on which registers. The processor then takes these instructions, highly optimized for sequential execution, and tries to find parallelism. It overrides the decisions made by the compiler as to instruction order and reorders them. It overrides the decisions made by the compiler about registers and renames them. It guesses which memory accesses will be required using caches, that work most of the time. It guesses which way branches will go. And then, perhaps the most amazing part of all, it does all this in a way that the instructions all seem to execute in the exact order that the programmer and compiler wrote in the first place.

But for specialized algorithms, with well-understood parallelism and algorithms, throwing all the knowledge away and then trying to recover it on-the-fly is not optimal. This is especially true when the basic operations are much larger than register adds and multiplies, whole matrix operations for example, or neural network inference (which is pretty much matrix operations anyway). Specialized hardware units can be used to perform those operations a lot more efficiently than executing a long sequence of hundreds of general-purpose processor instructions.

Specialized domains call for domain-specific computing.

GPUs

There are some algorithms that are known as "embarrassingly parallel". There is so much parallelism that any feasible number of cores can be taken advantage of. The most obvious of these algorithms are graphics processing, where each pixel on the screen can be calculated almost independently. More recently, a lot of neural network training algorithms have the same characteristic, since they involve huge numbers of largely independent multiply-accumulate (MAC) operations. I have heard, but it is way beyond my area of competence, that some molecular-level drug discovery and protein folding problems are the same.

For these algorithms, that means that if we can put more cores on a chip then we can use them. Obviously, we can't just put more general-purpose cores on a chip since there isn't room, or we'd have done it already. However, we don't need general-purpose cores if we are only going to do graphics or neural network training. Mostly, these just require large numbers of matrix operations. If we designed a core that could just do those matrix operations, then it would be a lot smaller than a general-purpose core. Further, since the scheduling of operations is pretty much known in advance, caches don't really buy us anything. They are there to smooth out access to DRAM when we have no idea in what order the DRAM is going to be accessed. If we know in advance, we can just load the values from the correct addresses already.

That is what a GPU is. It is a hugely multicore processor, where each processor can only do the basic operations required and so is a lot smaller than a general-purpose processor. But there are a lot of them. NVIDIA and AMD GPUs may have hundreds of cores. Some gaming systems have thousands, although I think they involve more than one chip. There is memory in each core, but no need for an overall cache since the workload is pumped into the system in the order required.

So a GPU is the one type of domain-specific computing. Tomorrow, we'll look at some more.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.