Instruction Decoders: RISC vs CISC

15 Dec 2020 • 9 minute read

In my post The Start of the Arm Era I said that it feels like something significant is changing. There's something Arm-y in the air. Suddenly Arm is faster than all x86 processors except the highest end of AMD's line. But why now? The three big announcements have been:

Amazon AWS's Graviton 2 (from its Annapurna subsidiary)
Arm's own Cortex-X1 (see my post Arm Goes for It)
Apple's M1 (also covered in Arm Goes for It)

I think that there are two reasons for this flurry of activity. One business. One technical.

The Business Reason

Arm's business, and still the central redoubt of its empire, is mobile. I've never seen a mobile phone in the last 20 years that wasn't powered by Arm. Smartphones, even top of the line ones, don't want "MIPS at any cost" unlike in the data center. Battery life is important, and the phone has to go in your pocket without burning you. Before we weren't allowed incandescent lightbulbs anymore, I'm sure most of us have had the experience of burning our fingers changing one. A hundred watts is a lot of power.

Until Arm got an R&D budget boost with Softbank's money, it could only produce one high-end (Cortex-A class) microarchitecture per year. So that had to be aimed at mobile, and everyone else had to make do. The meant that Arm's partners who were building chips for servers had to focus on building servers for the internet, not for computation. Serving thousands of users on the internet requires lots of threads, and it doesn't matter so much how fast each one runs. So the value proposition of an Arm server was something along the lines of "more cores, a tenth of the size, a tenth of the power, a tenth of the cost". For example, see my post How Arm Servers Can Take Over the World. Or Qualcomm and Arm Drink Their Own Champagne. But that value proposition didn't seem to be compelling. For example, Qualcomm shut down its Arm server program, as did HPE. If the cloud guys like AWS are all going to roll their own, then those were probably wise decisions.

So Arm designed its first chip for data centers, the Neoverse-N1. This was still not aimed at being competitive with x86, still a play with non-leading-edge single-thread performance, but high aggregate performance and much lower power. But Annapurna/AWS took that and designed first Graviton (in 16nm) and then Graviton 2 (in 7nm). Suddenly, it was clear that the Arm architecture could be competitive with x86 if that's what you aimed for.

Arm had their own high-end project, the Cortex-X1. We'll have to wait until one of Arm's partners actually reveals silicon on the design to know how this stacks up against the other high-end Arm chips, but Arm says that it is a no-holds-barred design to get the highest performance possible without sticking to the kind of power envelope of previous Cortex-A or even Neoverse designs.

Then Apple did the 5nm M1 chip in its first Arm-powered Macs, getting even higher performance. It is unclear how much of that is from clever design, and how much is that the M1 is in 5nm (compared to Graviton 2 in 7nm, and Cortex-X1 which hasn't seen silicon). And, to be fair to x86-land, AMD's top of the line is still in 7nm, not 5nm. Intel's technology development struggles to deliver 10nm and 7nm are well-known (these Intel nodes are roughly equivalent to what everyone else calls 7nm and 5nm).

I expect to see other companies designing their own Arm-based silicon now that it is clear that Arm can exceed x86 performance with a fraction of the power. Plus you can't build an x86-based SoC since neither Intel nor AMD is going to license you a core.

The Technical Reason

Let's go back in time. You may not have heard of it, but the IBM 801 was the first RISC processor (although it was not a chip, this was still 1974). It was created by John Cocke (pictured), who won the Turing Award in 1987 as a result. It is a distant ancestor of the Power architecture. For this bit of history, the important part was the compiler. Since the 801 was a RISC, almost everything was register-to-register except a load and store instruction (plus some branch stuff, of course). The compiler was written with that assumption built-in. As an experiment, the 801 PL/I compiler was retargeted to the IBM 360 instruction set (not RISC). But the compiler had the simple RISC architecture assumptions built-in, so it couldn't make use of most of the IBM 360 instruction set. It just did everything in registers and used a single load and a single store instruction. You can probably guess where this is going. That compiler produced code with three times higher performance than the actual IBM 360 PL/I compiler that made use of the whole instruction set.

The message of that experiment was that all those complex instructions don't actually help you that much with performance. On the other hand, in that era, they didn't seem to slow things down as long as you didn't use them.

In that era, memory and processors were about the same speed. The Interdata computers I used for my PhD research ran at about 1MIPS so between fetching instructions and fetching operands I guess that memory access time was about 0.7us or 700ns. Memory has sped up from 700ns for the old ferrite core memory, to about 70ns for a modern DRAM, so a factor of 10. But processors have sped up from 1MHz to 3000MHz aka 3GHz, a factor of 3000. There is a huge mismatch between processor speeds and memory speeds. To make this clear, in my post Numbers Everyone Should Know, I describe the thought experiment of slowing the computer down so its clock rate is 1Hz. At peak, it executes an instruction every second. But if it has to access the main DRAM memory that takes...6-7 minutes.

Obviously, there is no way you could get current levels of performance from an architecture that loaded an instruction from memory, executed it, and then went on to the next one. In fact, for a long time, processors had been pipelined, perhaps loading one instruction, decoding the previous one, executing the one before that, and storing back to memory after that if necessary. Actually, pipelines can be a lot deeper than that. But that is not enough.

To get more performance, multiple instructions have to be executed at the same time. This can be a bit tricky since obviously, you can't always execute instructions together since one instruction might need the result of another. I won't go into the details about all of that today. If you are interested, see my post How Do Out-of-Order Processors Work Anyway?

But in order to get performance by executing several instructions at once, you have to fetch several instructions at once, and decode several instructions at once.

So how do you do that?

I already showed you how slow DRAM access is, so one thing is to cache instructions so you don't need to go to DRAM. Typical hit-rates are 95+%. But wherever the instructions are coming from, it only takes the same length of time to load a lot of instructions as to load a single instruction. So all modern high-performance processors load a block of memory and then decode several instructions at once. From what I can tell, both Intel and AMD decode four instructions at once (for the top-of-the-line processors). The M1 decodes eight. The Cortex-X1 does five. I can't find a number for Graviton 2, but the Neoverse-N1 on which the original Graviton was based does four.

So Intel and AMD need to up their game and decode eight instructions at once, right? The trouble is, they can't.

Arm instructions are all 64-bits (8 bytes). So it is straightforward to load a block of memory and have eight decoders attack it, one starting at byte 0, a second at byte 8, up to the eighth starting at byte 56.

But x86 is a CISC architecture, and so the instructions vary in length from 1 to 15 bytes. So an x86 processor loads a block of memory, and the first decoder starts at byte 0. Where does the second start? The next instruction could start anywhere from byte 1 to byte 16, but until the previous instruction is decoded it remains a mystery. The only solution is to be very wasteful and have, say, 32 decoders. You decode an instruction starting at every byte. Of course, most of these are not actually instructions and will need to be discarded. So in the first phase, you decode 32 instructions, then in the second phase, you know the instruction lengths and can decide which are the actual instructions and which are just garbage. If the average instruction length is 8 bytes, then those 32 decoders will deliver four instructions most of the time. This makes the decoder so complex that it is hard to add even one more instruction (to get to five) never mind four more to get to eight.

From work done at DEC in the early 1980s, it was discovered that with the Vax, 20% of the instructions required 60% of the microcode but only occupied 0.2% of the execution time. Those really complex instructions were rarely used. The problem for any CISC architecture, such as x86, is that even though these instructions are rare, the architecture has to support them.

For example, famously the first instruction in alphabetical order for the x86 is the AAA instruction, which stands for "ascii add after addition".

The AAA instruction is only useful when it follows an ADD instruction that adds (binary addition) two unpacked BCD values and stores a byte result in the AL register. The AAA instruction then adjusts the contents of the AL register to contain the correct 1-digit unpacked BCD result.

BCD is "binary coded decimal" which used to be used when computers were really slow at multiplication and addition, to avoid converting decimal numbers into binary and back again. It is "never" used anymore. But still, every x86 processor has to support it, including the decoder.

So unlike with the IBM 801, where we came in, those complex instructions really do slow you down, even if you never use them.

Summary

There's a lot more to a micro-architecture than the decode width, of course. With caching done at the micro-op level, too (don't worry if you don't know what this means), in all these processors, it's not clear how important decode width is, but all the Arm processor teams decided it was important enough to go for a larger number. it is a place that x86 cannot really follow.

At a high level, there are only two ways to improve a processor's (core's) performance:

Increase the clock rate
Increase the number of instructions executed per clock, known as the IPC (instructions per cycle)

For the last 15 years or so, it has not been possible to increase the clock rate by much due to power considerations. Dennard scaling ended long ago. All the microarchitectural tricks and caches are aimed at increasing the IPC, but there are not really any new tricks left, other than more of the same: bigger caches, wider instruction decode, bigger re-order buffers, more execution units, tweaking branch prediction. At the level of whole systems, you can also add more cores, but it is difficult to efficiently use a lot of cores with most programs.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.