Get email delivery of the Cadence blog featured here
As part of the RISC-V workshop, Dave Patterson gave a talk on computer architecture. I split it into two. The first piece appeared yesterday Fifty Years of Computer Architecture: The First 20 Years.
Our intrepid hero, Dave Patterson, is an untenured professor on sabbatical at DEC where he has had his paper rejected. But the real world is not so easily rejected. If the microprocessor people imitated the mainframe people, and built complex instruction sets, then they would need to be fixed in the field.
Dave’s key realization was to get rid of microcode, which had originally been motivated by performance tradeoffs that no longer were true. So instead of having microcode, just run the code directly out of a fast on-chip instruction cache, and stick to just simple instructions. The implementation could have a hardware pipeline. It would then be small enough that a 32-bit microprocessor, and a small cache, would fit on a single chip, unlike the microcode approach Intel’s Oregon group was pursuing that required several. That meant signals did not need to cross chip boundaries so it would run at high speed.
This was the reduced instruction set computer or RISC. In 1982, students built the first RISC chip. It was 45,000 transistors. The RISC II, which was a better design, was in 3um NMOS, ran at 3MHz and the size is 60mm2.
Dave made what he regards as a big mistake. Instead of calling the next machine RISC III, it was called SOAR (for Smalltalk on a RISC), and then the next was called SPUR. When Dave told Krste his regrets, Krste said “Great, we’ll call the next one RISC-V.”
For a time there was heated debate about CISC vs RISC. Given Intel sold 350M x86 chips per year and dominates in servers and desktops, it looks like CISC won. But under the hood, RISC has really won. Intel is stuck with their x86 CISC ISA, but on the silicon the instruction decode turns those into simple ones for the execution units, and then any RISC ideas can be used.
The Intel story isn’t over. Having shut down the i432 project in Oregon, they started a new one (also in Oregon). Intel had a business issue, which was that in the early days of semiconductors, customers wouldn’t trust a single supplier for a complex component like a microprocessor, they required a second source, another company that could manufacture the component. When I was at VLSI, for example, we were a second source for the Hitachi H8 microprocessor. Once SoCs and on-chip processors came along, the whole idea of second sources went away, but from the early days AMD had been a second source for Intel’s x86 microprocessor line. The business issue for Intel was that AMD was being far too successful selling x86 chips and eating into what they considered their rightful business. But a big transition was coming up, and microprocessors would move to 64 bit.
Intel decided that if they changed the ISA, then they would no longer need to license it to AMD. They went with an approach called VLIW, which stands for very long instruction word, perhaps 500 bits, although Intel called it EPIC, for explicitly parallel instruction computing. The idea was to keep the hardware simple and put all the dependencies in the compiler. Intel’s product was called Itanium.
At first, everyone believed the hype. HP helped develop the architecture and committed their servers to Itanium, SGI abandoned MIPS for Itanium.
But it turned out that there way too many unused resources: instruction slots unused and execution units idle. VLIW only works well if most of the execution units are keep busy most of the time. It works quite well for very specialized loops doing intense processing, such as DSP (Tensilica is a VLIW architecture), but for general-purpose computing, it is not so effective. Branch prediction didn’t work well, cache misses were unpredictable, code sizes was enormous, and the compilers turned out to be impossible to write. Itanium got nicknamed Itanic after the Titanic movie that had just come out.
In the meantime, AMD extended the 32-bit ISA to 64 bits in a natural way. Customers loved it and it became obvious to Intel that they had to have microprocessors using that ISA. That is why cloud datacenters today are full of microprocessors manufactured by Intel, but they all implement the AMD 64-bit ISA. Itanium was quietly left to sink.
But the golden age of Moore’s Law and Dennard scaling ended 10-15 years ago. The world went multi-core to escape the power issues. But that always runs into an observation Gene Amdahl made 50 years ago this year, now known as Amdahl’s law, that says that the kernel of code that cannot be parallelized will almost always limit the speedup factor possible to a small number. But single-thread performance is now increasing at about 3% per year, so maybe it will double in 20 years. There just are not a lot of architectural tricks left.
As I wrote about recently in Are General-Purpose Microprocessors Over?, what’s left is specialized processors, also called domain-specific architectures. For example, Google’s TPU has 30-80 times better performance per Watt. It has 65,000 MACs on it, so that is for sure a different architecture.
So this is a second golden age of architecture like in the early RISC days, when things are moving fast. “The semiconductor guys have tossed it to the architects.”
Areas ripe for improvement are VR, human interface, vision, etc. We don’t want to build hardware for the wrong algorithm, but there are huge gains available for important algorithms. The future of architecture will be more heterogeneous. “It is now conventional wisdom and I share that,” Dave said.
On the other hand, there is surprisingly widespread agreement on instructions sets for general purpose so we don’t need lots of them, just need a reasonably good one. The OS community largely uses Linux. Compilers standardized on LLVM. We need to do the same around a single ISA, and that is the potential brass ring for RISC-V.
Another change is the concept of a base instruction set and extensions. For the 50 years of ISA history, ISAs have all been monolithic. For example, ARM decided to start big, you can’t subset it, and so you have to deliver everything, the way everyone has done it since those early IBM 360 days back in over 50 years ago. But on a custom chip there is no need to put everything in every time. “It’s like dim sum, you don’t have to buy everything they make.” There can be special instructions for application areas. Right now, floating point is well understood so it makes sense to standardize that. Neural networks, it is too early; everything is changing so fast it might be another five years before there is any sort of consensus. In the past, nobody worried about this and didn’t deliberately leave room set aside for additional instructions. RISC-V left a lot.
As to the basic computing substrate, silicon, there doesn’t seem to be anything new on the horizon there. Quantum computing is deep physics, and their hope is to do a toy problem that nobody cares about faster than any general-purpose computer can. It will be amazing if in 10 years they can do the same calculations you can do on your phone.
RISC is the consensus today. There hasn’t been a new CISC in 30 years, VLIW has remained only in specialized niches. Computers have got way, way faster, partially due to Moore's Law but also due to architectural innovation. A Vax would take 10 clock cycles per instruction, modern microprocessors are measured in instructions per clock cycle.
But the basic insight that Dave had on his sabbatical nearly 40 years ago has endured. Or as Dave put it himself, "Who'd have thought what I worked on as an untenured professor would still be going strong today."