What is Meltdown? How Can It Affect Both Intel and Arm?

3 Jan 2018 • 8 minute read

If you pay attention to anything to do with processors, security, or even investment discussion sites covering companies like Intel, you may be aware that 2018 has started with the discovery of a major security flaw that affects all high-end microprocessors including Intel (desktop, laptop, and servers), Arm (smartphones and servers), and perhaps other manufacturers (the situation with AMD, in particular, is a little unclear as I write this). This is not a software bug. It is not even a bug in the x86 or Arm archiitectures. It is a bug in the normal way that high-end microprocessors from most (all?) manufacturers are implemented. That is how it can affect a wide range of processors from several different companies, all designed by separate teams.

I decided this is sufficiently important to write an extra post today to try and give you little bit of an explanation of what is going on. It impacts software, microprocessor design, memory subsystems, and more. It only affects processors that run code that the user can change, such as the x86 processor in your PC or Mac, or the Arm processor in your phone. It doesn't affect processors that only run code created as part of the SoC design, in particular, Tensilica processors in your devices (I don't think that they are vulnerable anyway, they use VLIW architecture, which takes a different approach). It also doesn't affect lower performance processors, such as microcontrollers or Arm Cortex-M processors, so IoT devices are all fine.

If you know enough computer science, you can read the original paper (Meltdown, which is the full title of the paper) or the still very technical simplified explanation. I'll try and give you a flavor of what is going on so you have some idea of how you can end up with a security vulnerability that is in almost every high-end processor designed in the last decade or so.

An Analogy

Let's say you are in a shipping department of a company. Almost every day, a certain team gives you a package to ship to the same address. To be more efficient, each day you type up the label ready for the package. If the package needs to be shipped as usual, you have gained yourself a little time. If the package has to go to a different address, you wasted a label and throw it in the bin, and type out the correct address. If the bad guys, who don't know where you normally ship the package, wants to find out the address, then they can exploit your behavior by bringing a package to be shipped to a fake address. It doesn't matter where to, it is enough to cause you to throw the label you speculatively typed into the trash and type the fake address. They can then look in the trash and find the label you typed but didn't use. At some level, everything is good. No package got shipped to the wrong address. Most of the time, pre-typing the label saves a bit of time. But there is one little difference, something that was never used but was thrown away, but which might be able to be exploited.

OK, that's not really all that realistic. But microprocessors also need to get work done and they also do things speculatively, with a mechanism for correcting things if they guess wrong. The main one is called branch prediction, where the processor will run ahead and evaluate instructions after a branch before it is clear whether the branch will be taken (for example, at the end of a loop). This is analogous to you typing the label, knowing that most of the time it will be right, but keeping an eye out for the less frequent event that it is wrong. In the same way, inside the processor, most of the time this works well (the loop is taken so it makes sense to get started on the next run through). But when the loop ends, that work will be wasted and it has to be thrown away quietly, just like you had to throw away the incorrect label.

Another place where this happens is when the operating system (Linux, Android, iOS, and so on) is entered from normal user code. There are a number of ways that this can happen, but one is if a normal program tries to access data it should not, such as parts of the operating system. This is not allowed, and the operating system gets control and either shuts down the program (your app "crashes" on your phone, for example) or, if the program has made plans, the operating system notifies the program.

Another bit of processor design you need to know a little about is that most systems have memory caches. DRAM chips may sound fast, but compared to the processor, they are very slow. So on the processor chip, the most recently accessed DRAM data is stored in really fast memory, in the expectation that it will be needed again soon and so getting a second time from DRAM can be avoided. I won't go into the details, and it is not straightforward, but a program can tell (through timing) if the data was in the fast cache memory, or if it had to be fetched from the slow DRAM.

Meltdown

Here is how Meltdown works. First, the program tries to access a particular byte in the operating system data. The processor loads the byte into an internal register without doing the slower check as to whether this should be allowed. Eventually, the processor will notice that you shouldn't have been allowed to do this, the data will be discarded, and you will never see it (your app will "crash" or get told it was misbehaving). If that was all that happened, it would not be a security vulnerability.

But modern processors are even faster than that, and they execute the next instruction, too, before discovering that the byte from the operating system should not have been accessed. A byte has 256 values, and that data value loaded into the internal register can be used to access one of 256 different memory locations. Again, this data will eventually be discarded when the cops catch up and notice that the first instruction should never have been allowed. So you accessed a byte, you used the byte to access another item of data, and then you didn't get to see either the byte or the data. That's how the architecture is meant to behave.

But one tiny thing is different, like the discarded label in your trash can in my bad analogy. If the cache was cold (meaning here that none of the 256 different memory locations were in the cache before these two instructions) then now one of them is warm, in the cache, as a result of the discarded speculative execution to get the data that you never got to see. By seeing which of the 256 locations is now warm, the program can work out what that byte of operating system data contained even though the processor correctly never revealed it directly. By running through all the bytes of the operating system, all the tables (including keys, keystrokes, and who knows what) can be read by the user program.

This is just one of three vulnerabilities, and it doesn't work quite as I described since I (over-)simplified it. But it gives you the basic idea of what is going on when out-of-order execution interacts with caches in a way that creates a leaky side-channel.

Mitigation

Since this is not a software error, it can't simply be patched. The processor and the operating system are all behaving correctly. The data is being deduced from an artefact of correct behavior, known in the security world as a side-channel. One side-channel that chip designers should be aware of is differential power analysis, working out what is going on in a chip by analyzing its power. By the way, that was discovered by Pau Kocher who is on the author list for the paper (see above, and BTW Wikipedia says he is CEO of Cryptography Research without noticing that they were acquired by Rambus and he since left). This side-channel is working out what is going on by timing differences in whether data is in the cache or not.

The current fixes I have seen are to put the operating system data into a completely separate address space so that all the values that can be seen from a user program are zero. But that means that every system call, and every device interrupt, will now involve a change of address space. Previously, the operating system data was part of every program, just that it could not be accessed by user programs. This transition is relatively slow and the impact in practice is still being analyzed. Numbers seem to range from 5% to as much as 30%, although this is probably a pathological worse-case. However, when your datacenter has 100,000 servers and they all get slowed down by 10%, that is the equivalent of 10,000 servers suddenly going dark.

What to Do

This is a developing story. Yesterday, January 2, the existence of the problem was made public, but without any details. Today, January 3, the details were made public. As the tabloids like to say, "it is worse than we thought."

The bug was found some time ago and reported to Intel (who have paid a bounty for its discovery), Arm, AMD, Microsoft, Apple, Google, and others, to give time for fixes. Linux has (maybe) been fixed but with all the comments removed to make it hard for anyone to deduce the vulnerability from the fix. Android, MacOS, and iOS have apparently been fixed (if you don't have the latest version of the OS on your phone, then now would be a good time to update). I'm not sure about Windows, there is talk of it being pushed out in the next patch Tuesday, which I believe is next week.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.