Domain-Specific Computing 1: The Dark Ages of Computer Architecture

13 Mar 2019 • 8 minute read

This is the era of domain-specific computing. Or, to use the words of Dave Patterson and John Hennessey from their Turing Award acceptance presentation, the "New Golden Age of Computer Architecture." Of course, to have a New Golden Age, we had to first have some Dark Ages. Paradoxically, these were caused by improvements in general-purpose computer architecture (and a hefty dose of Moore's Law).

The Dark Ages

For a couple of decades, we were in the era of happy scaling. Moore's Law delivered a new higher performance process every couple of years with no increase in either cost per transistor (cost per die went up but that was largely because they contained more transistors), nor increase in power dissipation despite the higher clock frequency. At the same time, improvements in computer architecture for general-purpose processors increased performance, too, even without taking into account increases in silicon performance. Add those two things together and we were in what I'm calling the dark ages of computer architecture.

You can see it in Hennessey and Patterson's graph above. From about 1985 to the early 2000s, CPU performance improved by over 50% per year.

So why would I call it the dark ages when performance was increasing so fast?

Because there wasn't much point in doing any other kind of computer architecture. Let's say it's 1995 and you've got a good idea how to design a specialized processor for something. Since this is EDA, we can take the example of hardware emulation. You've got a great idea for a hardware emulator that you can build and it will be 100X faster than Verilog simulation running on a server. Back then, it would probably have been a Sun Microsystems Workstation rather than a PC (with an Intel x86 processor under the hood). Anyway, it takes you two years to go from your great idea, to design a chip, prototype the system, and ramp up manufacturing. In two years, performance of the "competition", Verilog simulation running on a workstation, has increased in performance by 50%...twice.

But the Verilog simulator software engineers haven't been sitting around doing nothing, they've improved the algorithms independently of performance gains from faster processors. I don't know what a good number to use is over that period, but let's just say 25% per year to keep the numbers simple. So over two years that increases performance by 1.5X.

So in two years, the performance of Verilog simulation on a workstations has increased by 2.25X due to silicon performance and another 1.5X due to improvements in the simulator itself. Let's call it 4X.

That means that your 100X faster than Verilog is down to 25X. The EDA sales cycle is typically nine months, let's say a year. The customer takes a year to work on their chip using your emulation system. That's another two years and another 4X increase in performance for the software solution. Your 25X drops to 6X. Nice to have, but hardly compelling for an extremely expensive piece of kit. A couple more years and it is obsolete.

This type of argument applies to almost every area where you might think about building a special chip or a special processor to do something specific. The "competition" of just doing it in software and riding Moore's Law and the general-purpose computer architects at places like Sun, Intel, Arm, AMD, and others is hard to beat. It is especially hard to beat in terms of price, since all users have to do is buy new workstations every few years, which they will do anyway. By definition, the workstation is general purpose, users don't buy them just for simulation. But your emulator or other specialized hardware is, by definition, limited to whatever you designed it for, so it is a much less attractive investment.

What Changed?

This is not the place to go into all the gory details of how a modern general-purpose processor is constructed. The image shows the internal architecture of Intel's Haswell, a 2013 vintage CPU, to give you a flavor of how much I'm not including here. I'll just talk about three aspects to give the flavor, and which will become important:

Out-of-order (OoO) execution
Cache memory
Branch prediction

Normally, instructions are executed one after the other, rather like reading the words in a book in order. That's what the programmer (and the compiler writer) assumes. But architects realized a long time ago that often more than one instruction can be executed at a time, provided that one doesn't depend on the other at all. I think the first place this was used was the CDC 6600, designed by legendary computer architect Seymour Cray in 1964. (I said it was a long time ago.)

Modern state-of-the-art processors manage to execute over two instructions per clock-cycle (for example, I wrote about the Samsung Galaxy S9's Application Processor presented at Hot Chips last year). One important thing to realize is just how much work is going on to achieve this. For that Samsung AP, there are:

Decode up to 6 instructions on each cycle
Rename, dispatch, and retire up to 6 instructions per cycle
Up to 9 integer operations issued per cycle (plus floating point)
228 entry reorder buffer (to find instructions ready to be executed, and handle when they complete)

That's a lot going on to execute two instructions per cycle. Even without understanding what all the architect-speak means, it is probably obvious that this is going to take a lot of chip area (cost) and consume a lot of power. In fact, almost all the power a modern microprocessor uses is overhead. Very little is actually executing the instructions themselves.

The second thing is cache memory. During the period when processor performance was increasing 50% a year, DRAM performance was only doubling every decade. It's reached the point today where the processor can execute 200 instructions in the time required for a DRAM access (70-100ns). See my post Numbers Everyone Should Know for more about how long everything takes in a computer. The solution to this is to add cache memory, which is faster, more expensive, memory on the same chip as the processor. This is implemented using either SRAM or embedded DRAM (eDRAM). This holds recently used values from DRAM on the basis that they are likely to be used again soon. If that guess proves correct, the access is a lot faster and the processor doesn't have to stall and wait for a real DRAM access to be performed.

Typically there are several levels of cache. First, very small, very fast, and very expensive cache close to the processor, called level 1 cache, often written L1$. Then between one and three more levels built with static RAM. Typical access speeds are 0.5ns for level 1 (almost the same speed as the processor), 3ns for level 2, and 28ns for level 3 (and remember, ~100ns for DRAM). A modern processor typically gets about 97% of accesses from the cache hierarchy.

As processors got faster and faster, and memory didn't, the only way to close the gap somewhat was bigger and bigger caches. Look at a die photograph of any modern processor and you can see that a lot of the chip is memory. Actually, these days, you'll mostly see the core of a multi-core processor, along with a large part of the die taken up by the top-level cache, shared amongst all the cores.

Here's a die photo of an IBM Power8 processor (2014 era). It actually has four levels of caches, those are the blocks down the sides There is a level 3 and level 2 cache in each core (outlined in just the top left core). The level 1 cache is too small to show on a chip this big. I'm guessing at least 40% of the chip is cache (each core is 50% cache as you can see). The point is that the caches are expensive. 40% of the area, is 40% of the cost. Memories are not the biggest power hogs so it will not be that much of the power budget, but it is not negligible either.

The third complication is branch prediction. It's not much use having all this out-of-order execution infrastructure if it all grinds to a halt every few instructions when there is a branch. On average, every fifth instruction is a branch. Obviously, if the branch is unconditional, the processor can just carry on executing instructions at the destination. But if it is a conditional branch, the processor has to guess. This is known as speculative execution. To decide whether to guess whether the branch is taken or not requires another memory, the branch prediction buffer (sometimes called the branch history table). It turns out that most branches do the same this time as they did last time (go around the loop again, or checking for a rare event, are both common uses of branches). But adding more history to the branch prediction buffer can do better still, such as noticing a branch that is taken every other time. The latest predictors use neural network algorithms.

The big complication in branch prediction is not when it is right, but when the predictor guesses wrong. A lot of instructions will have been partially executed but hanging in limbo waiting to find out if the predictor got it right. If it was correct, the instructions complete (they are "retired", in processor speak). If the predictor was wrong, all the work needs to be discarded without trace, and then the correct branch needs to be executed. It is actually even more complicated, since there are other "branches" that can have this effect, such as dividing by zero or attempting to access a non-existent memory location. These generate exceptions, which cause the code sequence to move somewhere else, and in many ways looks like an incorrectly predicted branch.

Tomorrow

The era of "easy" performance increase is over. It is the end of the dark ages.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.