Spectre/Meltdown & What It Means for Future Design 1

11 Sep 2018 • 8 minute read

At HOT CHIPS, one of the "keynotes" was actually a panel of what I'll call industry luminaries. They were discussing the implications of vulnerabilities such as Spectre, Meltdown, and the recently announced Foreshadow. This is the most important discovery in computer architecture in the last twenty years or so, and will affect how processors are designed forever. Later in the conference, for example, Intel presented their next generation processor, Cascade Lake, and discussed some of the changes they have made as a result. Later in the session, Jon Masters said that Red Hat alone has spent over 10,000 hours on these issues.

I am going to cover the panel in detail. Obviously, it affects processor architects the most. But it affects anyone who uses processors, such as software engineers or SoC designers. Everyone needs to be aware of the implications of this. One takeaway, if this is going to be more than you want to know, is that we don't know how to completely protect against this type of attack without reducing processor performance to a few percent (under 5%) of what it is today.

If you want more details on Spectre and Meltdown, then a good place to start would be my post Spectre and Meltdown: An Update. Or if you want to hear about it from one of the sources, then Paul Kocher: Differential Power Analysis and Spectre. Or if you want to hear about it from one of the panelists, try Spectre with a Red Hat and Spectre with a Red Hat 2.

An Introduction to Speculative Execution

tutorial You can do whole advanced Masters-level courses on computer architecture that covers this, so in a few paragraphs, this is going to be the most basic of introductions.

Moore's Law might be limping now, but over the last couple of decades, processor performance improved an enormous amount through a mixture of scaling and architectural innovation. For a decade it was improving at 45% per year. However, off-chip DRAM access did not speed up nearly the same amount. This meant that the processor could execute about 200 clock cycles in the time it took to do a DRAM access. The first solution was to add on-chip cache memory that was much faster. By keeping the frequently used instructions and data in the cache, those 200 cycles could be reduced to a lot fewer. Over time, we went to multi-level caches, with a mixture of small, very fast memories, and larger but no so fast memories. But for this introduction, we don't need to get into those details. We'll assume a fast on-chip cache, and slow off-chip DRAM. In round numbers, a cache access takes 0.5ns whereas a main memory access takes 100ns (hence the 200 cycle number). Most instructions and most data would come out of the fast cache, and so those 200 cycle delays were mostly avoided.

But not all of them. Processor architects realized that the processor could do stuff while it was waiting since often many of the following instructions didn't depend on the value coming from memory, so the processor could get on and execute them anyway. This worked fine for every instruction except conditional branches. When the processor ran into a conditional branch, it could stop and wait for the values coming from memory to arrive, and then discover if the branch would be taken or not. Alternatively, it could take a guess as to whether the branch would be taken, and carry on executing instructions that didn't depend on the values it was awaiting from DRAM. This is known as branch prediction. It is beyond the scope of this little explanation as to how that is implemented, but you win a lot by just following the rule "assume every branch does what it did last time it was executed."

However, there was one big complication. What if the processor guessed wrong? That is why it is called speculative execution since it is guessing whether the branch would be taken, but also doing it in a way that it could clean up after itself if it guessed wrong. After a conditional branch, the instructions are marked as dependent on the branch. If eventually the processor determines that the branch was really taken, then the instructions are retired and the processor moves on. If it turns out that the branch prediction was wrong, then all the instructions that were done speculatively are squashed, and the processor backs up to the conditional branch and starts to execute down the correct branch. To give you an idea how complex this can get, the most advanced processors might get over 200 instructions ahead, guessing that the branch at the end of a loop would be taken and running through the loop many times (before finally discovering that the loop actually ended many iterations ago, and have to sort out the mess).

This is how all high-performance (so-called out-of-order or OoO) processors have been designed for about the last 20 years. From the programmer's point of view, the processor is executing the program in the order written. The way the processor is built, whether branches are predicted correctly or not, the results are exactly as if the instructions had been executed in order like the programmer imagines. Just faster.

For 20 years, nobody saw any problem with any of this. But last year Spectre and Meltdown were discovered. People in the need-to-know groups who had to try and fix these problems knew about it last year. The rest of us found out in the first week of this year. For processor architects, it was not a Happy New Year.

Meltdown is far easier to explain (and fix) so I'll give you a simplified overview of how it works. Let's say you want to read a byte of memory from the operating system that you shouldn't. You train the branch predictor so it will guess wrong that the code I'm about to describe will get executed. Also, you select an area of memory that has never been used so it is not in the cache. Then you do the following: read a byte from the operating system memory, and then use that byte to pick one of 256 locations in the selected area of memory and read the value from there (it doesn't matter what the value is). The processor will soon discover it got the branch wrong and squash all this.

But there is a tiny thing that is different. One of the 256 locations in that selected area of memory is now in the cache (hot), because we read its value. Even though the read was squashed, the cache line is still hot. Since accessing a value in cache is 0.5ns and from DRAM is 100ns, it is not that hard to check the timing of all 256 locations, only one of which will not require 100ns. So we know what was in the byte even though the read itself got squashed, so in a sense "we never read it."

As Paul Kocher said (in the post I linked to above):

These should have been found 15 years ago, not by me, in my spare time, since I quit my job and was at a loose end. This is going to be a decade-long slog.

The Problem

spectre meltdown Before going any further, let me emphasize the problem here. This is not a hardware bug in a single processor from a single manufacturer (I'll count Arm as a manufacturer here, although technically they license their designs to the people who actually do the manufacturing). This is a fundamental problem of the way in which processors are designed. Embarrassingly for all the people who work in the area, this is a weakness that has been hiding in plain sight for 15 to 20 years without a single person noticing (well, maybe the NSA and equivalents, who knows?).

Even if you didn't understand my explanation of speculative execution, just take this one fact away. A cache memory access is 0.5ns, and a DRAM access is 100ns. Processor architects use every trick they can come up with to avoid DRAM access, and to find useful things to do during the long delays when they can't avoid it. If we took away these tricks, speculation and caches, then we would have a processor with under 5% of the performance of current processors. No smartphones, no cloud datacenters, and Windows 98 era laptops. Party like it's 1999 doesn't sound so good in the processor space.

To make things worse, this has arrived as Moore's Law is running out of steam (and processors have hit the power wall too). So we don't even have a 2X factor that we could lose, and win it back with the next node. General purpose processors are simply not getting faster since we've run out of architectural tricks on the architecture side, and Dennard scaling on the semiconductor side.

I'm getting ahead to the panel, but one thing Mark Hill pointed out is that these vulnerabilities are not "bugs" in the sense that the processor does not meet the spec. These processors all met their spec. The problem is more fundamental still, the way we specify architectures is wrong, since a correct implementation is vulnerable to these side channel attacks.

In the aftermath of the discovery of Spectre and Meltdown, the immediate focus was on how to mitigate the problems with all the processors that were already out in the field. But the next step is to incorporate the knowledge of this type of attack into next-generation architectures. That was the focus of this keynote panel.

The Panel

There were 4 panelists at Hot Chips, chaired by Partha Ranganathan of Google. Each panelist gave a brief introduction, and then they got together as a panel and took questions from the audience.

John Hennessy, currently Chairman of Alphabet (Google), but one of the inventors of RISC (for which he just shared this year's Turing Award) and co-author of the standard texts on computer architecture (along with his co-Turing-Award-honoree Dave Patterson).
Paul Turner of Google. Google's Project Zero is one of the groups that discovered these vulnerabilities, and Paul was part of the group tasked with mitigation.
Jon Masters of Red Hat, the person responsible for fixing up Red Hat's Linux as well as is possible.
Mark Hill of University of Wisconsin at Madison and also on sabbatical at Google.

Tomorrow

Having tempted you with those names, I'll tell you what they actually said tomorrow.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.