Spectre and Meltdown: An Update

17 Jan 2018 • 10 minute read

I wrote an extra Breakfast Bytes post a couple of weeks ago about Spectre and Meltdown, which had suddenly become public a week earler than apparently was planned (having been disclosed to the relevant companies months before). I wrote that post only a few hours after information became available, but I thought it was important to get information out. After all, it turned out that the cause of the problem was optimization going on inside chips, which is clearly somewhere in the heart of the semiconductor ecosystem. I wrote mostly about Meltdown, since at that point there was more information available (I was writing just a few hours after the information became public).

The scary thing about these two exploits is that they are not a minor design error in a particular chip, they are attacks against a particular way of designing high-performance processors, and as a result, they potentially affect all high-performance processors designed in the last couple of decades. Technically, they might affect any processor that uses speculative execution and out-of-order execution. Lower performance processors are not affected and IoT edge devices are unaffected (unless you count smartphones and autonomous cars under that umbrella).

I don't know if it is because of the long leadtime between discovery and public announcement, or the importance of these exploits, but they have their own logos, Spectre (on the left above) and Meltdown (on the right). I don't recall exploits having marketing that is advanced enough to have logos designed. Even well-publicised malware such as Stuxnet (presumed to have been developed by the US government and/or the Israeli government) has to make do with the name written in scary fonts.

How Do They Work?

If you know enough computer science, you can read the original paper Meltdown (which is the full title of the paper). You can also read the paper Spectre Attacks: Exploting Speculative Execution. The Google Project Zero group that discovered the issues have a webpage with a simplified explanation, but you still need to have enough computer science to be able to read the code shown there. You probably need to at least have a basic grounding in what speculative execution is, and how a modern out-of-order processor is architected.

I'll try and give you a flavor of what is going on so you have some idea of how you can end up with a security vulnerability that is in almost every high-end processor designed in the last decade or so.

An Analogy

Let's say you are in a shipping department of a company. Almost every day, a certain team gives you a package to ship to the same address. To be more efficient, each day you type up the label ready for the package. If the package needs to be shipped as usual, you have gained yourself a little time. If the package has to go to a different address, you wasted a label and throw it in the bin, and type out the correct address. If the bad guys, who don't know where you normally ship the package, wants to find out the address, then they can exploit your behavior by bringing a package to be shipped to a fake address. It doesn't matter where to, it is enough to cause you to throw the label you speculatively typed into the trash and type the fake address. They can then look in the trash and find the label you typed but didn't use. At some level, everything is good. No package got shipped to the wrong address. Most of the time, pre-typing the label saves a bit of time. But there is one little difference, something that was never used but was thrown away, but which might be able to be exploited.

OK, that's not really all that realistic. But microprocessors also need to get work done and they also do things speculatively, with a mechanism for correcting things if they guess wrong. The main one is called branch prediction, where the processor will run ahead and evaluate instructions after a branch before it is clear whether the branch will be taken (for example, at the end of a loop). This is analogous to you typing the label, knowing that most of the time it will be right, but keeping an eye out for the less frequent event that it is wrong. In the same way, inside the processor, most of the time this works well (the loop is taken so it makes sense to get started on the next run through). But when the loop ends, that work will be wasted and it has to be thrown away quietly, just like you had to throw away the incorrect label.

Another place where this happens is when the operating system (Linux, Android, iOS, and so on) is entered from normal user code. There are a number of ways that this can happen, but one is if a normal program tries to access data it should not, such as parts of the operating system. This is not allowed, and the operating system gets control and either shuts down the program (your app "crashes" on your phone, for example) or, if the program has made plans, the operating system notifies the program.

Another bit of processor design you need to know a little about is that most systems have memory caches. DRAM chips may sound fast, but compared to the processor, they are very slow. So on the processor chip, the most recently accessed DRAM data is stored in really fast memory, in the expectation that it will be needed again soon and so getting a second time from DRAM can be avoided.

Accessing memory after a cache-miss typically takes hundreds of cycles. This has two impacts. One is that the processor really wants to find something useful to do in those hundreds of cycles. If it knows which instructions will be executed next (and some other technical things are true that make it safe), then the processor will run ahead and execute them. Often, the processor is not sure which instructions will be executed next since there may be a decision to be taken that results in doing one thing or another. In this case, the processor can execute both alternatives and later work out which one to discard. Or, using a technique called branch prediction, it can take a really good guess (less than 1% failure in practice) which alternative will be chosen and just do that one.

The second impact of the "hundreds of cycles" is that the program can tell, due to the different timing, whether an item of data was in the cache or not (did it take hundreds of cycles or was it almost instantaneous?).

Meltdown

Here is how Meltdown works. First, the program tries to access a particular byte in the operating system data. The processor loads the byte into an internal register without completing the slower check as to whether this should be allowed. Eventually, the processor will notice that you shouldn't have been allowed to do this, the data will be discarded, and you will never see it (your app will "crash" or get told it was misbehaving). If that was all that happened, it would not be a security vulnerability.

But modern processors are even faster than that, and they execute the next instruction, too, before discovering that the byte from the operating system should not have been accessed. A byte has 256 values, and that data value loaded into the internal register can be used to access one of 256 different memory locations. Again, this data will eventually be discarded when the cops catch up and notice that the first instruction should never have been allowed. So you accessed a byte, you used the byte to access another item of data, and then you didn't get to see either the byte or the data. That's how the architecture is meant to behave.

But one tiny thing is different, like the discarded label in your trash can in my bad analogy. If the cache was cold (meaning here that none of the 256 different memory locations were in the cache before these two instructions) then now one of them is warm, in the cache, as a result of the discarded speculative execution to get the data that you never got to see. By seeing which of the 256 locations is now warm, the program can work out what that byte of operating system data contained even though the processor correctly never revealed it directly. The program has found out the content of a single byte in the kernel space where the operating system lives.

By running through all the bytes of the operating system, all the tables (including keys, keystrokes, and who knows what) can be read by the user program.

Meltdown doesn't work quite as I described since I over-simplified it. But it gives you the basic idea of what is going on when out-of-order execution interacts with caches in a way that creates a leaky side-channel. Also, note that this allows data to be read, but doesn't allow Meltdown to take control of the computer.

Spectre

Spectre is more complex that Meltdown and I won't attempt to explain it in detail. It is next to impossible to do so without showing code examples. It works by tricking a victim program into reading data it wouldn’t normally access and then leaking data to an attack program running on the same machine.

In the cloud, or on something like AWS, many programs are typically running on any given server and so potentially an attack program can read some of the data of some of the other programs. The one bit of good news is that to exploit this to do anything malicious looks like it is very very hard to impossible. You can read data from whatever program happens to be running, but you can't really control whether you get anything valuable.

Mitigation

Since this is not a software error, it can't simply be patched. The processor and the operating system are all behaving correctly. The data is being deduced from an artifact of correct behavior, known in the security world as a side-channel. One side-channel that chip designers should be aware of is differential power analysis, working out what is going on in a chip by analyzing its power. By the way, that was discovered by Paul Kocher who is on the author list for the paper (see above, and BTW Wikipedia says he is CEO of Cryptography Research without noticing that they were acquired by Rambus and he since left). This side-channel is working out what is going on by timing differences in whether data is in the cache or not.

The current fixes I have seen are to put the operating system data into a completely separate address space so that all the values that can be seen from a user program are zero. But that means that every system call, and every device interrupt, will now involve a change of address space. Previously, the operating system data was part of every program, just that it could not be accessed by user programs. This transition is relatively slow and the impact in practice is still being analyzed. Numbers seem to range from 5% to as much as 30%, although this is probably a pathological worse-case. However, when your datacenter has 100,000 servers and they all get slowed down by 10%, that is the equivalent of 10,000 servers suddenly going dark.

What to Do

The bug was discovered (independently by two teams) and reported to the processor manufacturers in June of last year. The plan was to have a coordinated announcement of the issues along with fixes on January 8 but somehow it started to leak out and everything was announced on January 2 and 3.

Assuming you are just a user (you don't work in the microprocessor group at Intel, or the cloud group at Amazon, or something similar) then all you need to do is to do what you should do anyway, and keep your smartphone and your PC so it has the latest patches. Meltdown is easiest to fix. Spectre doesn't seem to have a fix, but it is so difficult to exploit the vulnerability, it is unclear whether anyone less than a nation state three letter agency organization would have a chance (and against that kind of threat you are pretty much powerless). For Linux, there is a set of patches that go under the name KAISER that address Meltdown (but not Spectre). See KAISER: Hiding the Kernel from User Space.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.