How Do Out-of-Order Processors Work Anyway?

8 May 2019 • 8 minute read

I've been meaning to write a post on how out-of-order processors work, but one challenge is to make the diagrams that are necessary to make it clear. Well, Jon Masters of Red Hat gave the keynote on the second day of the Linley Spring Microprocessor Conference. He was really talking about mitigation of Spectre and Meltdown, which I'll cover another day. But the first half of his presentation was an introduction to how processors work under the hood. One of his bêtes noires is that software and hardware people don't talk. I think his talk was one he put together more for software people, who don't know how their processors work, than for the audience of a processor conference, who generally do. But he had nice diagrams. So I'm going to take advantage of them.

Also, note that an introduction to computer architecture in a computer science course might be a semester long. This is inevitably going to be very high level and an oversimplification.

In-Order Processors

When processors were first invented, and some microcontrollers today, the architecture was very simple. The processor would fetch an instruction, decode it, and execute it. Then it would go and get the next instruction. An obvious improvement is to be executing one instruction, while decoding the previous one, and executing the one before that. This is known as pipelining. More stages can be added to the pipeline. If the design is well done, the more stages in the pipeline, then the faster the overall clock can run and, at least in optimal circumstances, the faster the processor will run.

One issue is with conditional branches. When a conditional branch occurs, the instruction fetch can't just go and get the next instruction until the condition has been evaluated. There are two possible outcomes: the branch is not taken and drops through, or the branch is taken and the next instruction is the target of the branch.

Various approaches have been tried over the years:

Always execute the instruction after the branch anyway, and leave it to the user to find something useful to put there. Then take (or maybe not) the branch since the condition will then be known. Problem: Often there is nothing useful to put there. I believe early MIPS did this.
Fetch both instructions after a conditional branch, and at the very last moment get a signal to that multiplexor to pick the right one to decode. Problem: Requires two instruction fetch units and only works with a very shallow pipeline. The Xerox PARC Dorado did this.
Make a guess, such as always taking the branch. Problem: You have to cope with when you get it wrong and flush the pipeline of all the incorrectly loaded stuff.
Do nothing, just wait until the result of the condition is known. Problem: Doing nothing is unattractive compared to guessing and getting it right 50% of the time.

One optimization is to add a branch predictor. This can improve the "guess" perhaps by a lot. It is surprisingly effective to just guess that every conditional branch will do whatever it did last time. But this is still an optimization, and sometimes it will be wrong and the pipeline will need to be flushed.

This is how an in-order architecture works. The instructions are executed in the order that the compiler sequenced them. There may be a branch predictor. A couple more steps are added to the pipeline than I described, there are instruction and data caches to improve performance.

The problem with this type of architecture is that it stalls and does nothing when something is not ready. If a read from memory is in the cache, no problem. But if it is not, then it might be dozens of instruction cycles doing nothing waiting for DRAM to deliver the goods. As processors got faster and faster at a 50% per year rate (during the 1990s) but DRAM speeds stayed the same, the penalty for a cache miss got bigger and bigger, up to hundreds of cycles on a cache miss. There was a lot of pressure to find a way to do useful work while waiting for the memory access to complete.

Out-of-Order Architectures

Often, one instruction doesn't depend on the next one. A trivial case is if one instruction loads one register from memory, and the next loads another register from memory, then they can be done in either order, or together. However, if the following instructions adds those two registers together, then that has to wait for both the previous instructions (the two loads) to complete.

In fact, the processor can go further. Even if a register is reused, but the old value in it is not used, then the register can be renamed, and operations can proceed anyway. So if after the add operation, those two registers are loaded with new values, and those values are added, there is no reason that those operations have to wait for the first add to complete. The old values are overwritten (and so can never be used again later in the code sequence).

By running ahead and loading up lots of instructions, they can be examined to see which ones are ready to execute. Registers can be renamed whenever values are overwritten. Those two loads can execute immediately. The add can execute once both loads are complete. By keeping a scoreboard of what is ready, then the processor will always do work, provided something is ready. With multiple execution units, lots of work can also be done in parallel.

The register renaming is done in what is called the ROB, for re-order buffer. Lots of instructions are fetched and decoded, they are put into the ROB, and then when everything they need to execute is available, the leave the ROB and are executed. This is not actually a new idea, the basic idea was used in the CDC 6600 in the 1960s (designed by legendary computer architect Seymour Cray).

Speculative Execution

That works great until there are conditional branches. Once a conditional branch is predicted, then lots of instructions are fetched at the predicted address. They can be executed, but the results need to be marked as speculative. If the branch prediction turns out to be wrong, then all the results will need to be discarded. If the branch prediction turns out to be correct, then those results marked as speculative can be unmarked ("retired").

This is shown in the table above. The code is on the left. With renamed registers on the right. The second-to-last column shows whether the instruction can execute. The last column shows whether it is speculative, meaning that eventually either all those instructions will be unwound, or they will be retired in the normal way.

Another complication is that there is now potentially a queue of writes to memory that are all waiting. Each one can only be finally written to memory when the write is actually retired. Otherwise, they will overwrite what was already in memory, and there would be no way to undo the write. Another complication is that usually writes to memory should be done in the correct order, so one write that has not retired can block others that have.

And as if that wasn't enough, loads need to peek in the queue of writes to memory to make sure they get the correct value, because it might not yet be in memory.

There has been a lot of work on making better and better branch predictors, and they can be in the high 90% accuracy. But that is still an optimization and there will always be at least the occasional incorrect prediction. The work done in error will need to be unwound.

Software View of the World

Programmers write code in a high-level language such as C or Python. Occasionally they write instructions directly, known as assembly language. But even if they use the high-level language, the compiler turns that into a sequence of instructions to be executed in order. The programmer has a simplistic view of the world that all the instructions execute in the order of the instruction sequence (obviously following branches as appropriate).

It's not obvious from what I've described in the processor architecture sections above, but even with everything executing out of order, and with branch prediction sometimes succeeding and sometimes failing, and instructions being retired seemingly randomly, it works. That is, it will indeed appear to the programmer as if the instructions executed in the order in which they appeared in the instruction sequence. It is this illusion that has meant that programmers think hardware details are grubby, and something they don't need to concern themselves with: the hardware people have done such a good job that programmers can get away with that attitude.

It's a bit like your bank. To you, there is your pile of money, and you can do stuff to it like take it out at an ATM, and your salary goes on top of the pile a couple of times a month. You don't need to know about fractional reserve banking. In fact, your money has already been lent to someone else, and when you show up at the ATM, they find some money to give you. But it feels as if it is the same money just coming off your pile.

Side Channels

A couple of years ago, a number of people started to think about the problem of whether or not there were ways that programmers could actually discover whether, say, the branch predictor was correct, or a value was in the cache or not. These are known as side-channels, since the "main" channel is the instruction sequence, and the hardware guarantees that the programmer can never see instructions being executed out of order (even though they are), they always appear to have been executed in-order.

These were what lead to the Spectre and Meltdown vulnerabilities announced to the world in January of last year. Jon is the guy at Red Hat tasked with working out how to mitigate the effect of these vulnerabilities, which was the real focus of his presentation. But I'll cover that in another post.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.