Sophie Wilson: The 2020 Wheeler Lecture (Multicore to Today)

11 Jun 2020 • 7 minute read

This is the second post continuing from yesterday's post Sophie Wilson: The 2020 Wheeler Lecture (The 6502 to Multicore) covering Sophie Wilson's Wheeler Lecture from Cambridge in the middle of May.

Here is the linkfest of the most recent posts I've written covering the evolution of microprocessor architecture:

The Multicore Consensus

Traditional scalar computation (on a single processor) has not increased in performance much since 2006—although it has got more power-efficient. Scalar programming languages are a poor fit for modern hardware and so we need a revolution in software. The problem is that more transistors just aren't useful.

scaling of processor performance

Sophie showed this graph and I've used it before (I think the original is in Hennessy and Patterson's book Computer Architecture). It shows that when we were just scaling transistors we got improvement in scalar performance of 25% per year. Then the big era with Dennard scaling (increasing clock rates a lot) and RISC we got 50% per year. With the switch to OoO we were 25% per year. Then with Amdahl's Law just 12% per year. Now we are down to 3.5%, the end of "easy Moore" or what I've also heard called "happy scaling".

Going to multicore doesn't make a lot of difference. Going from 1 core to 2 is noticeable, but to 4 not much. "You could buy something in 2015 and it is still state of the art performance today…painful”

Power Issues

Here is Pat Gelsinger's (then Intel CTO) famous graph showing power density in watts per square centimeter against some other things. "With enough Pentium Pros arrayed together, you could fry an egg". The problem is that the power used is not decreasing as fast as the size reduction so that the power density (W/mm2) is increasing. It is power than constrains our future.

The implication is that increasing amounts of silicon must remain dark (not powered up) even at desktop power levels of, say, 125W. It is even worse when we stack into 3D since the density goes up higher.

Economic Issues

Sophie went on to the fact that transistors cost more than on 28nm. So it's a sort of end of Moore's Law since only some things will be worth the greater expense of process geometries below 28nm. I'm not completely sure that this is true, since all the manufacturers of these processes say that transistor cost continues to fall once processes are fully ramped. Of course, you can take the view that "they would say that" and exact wafer prices are a closely guarded secret. Anyway, I won't argue with Sophie's view that mobile, desktop, laptop, and servers all run in enough volume and gain enough performance/power advantage. And that there are also lots of things that can't afford it. There is certainly a huge fixed cost associated with a very advanced design, in the form of NRE for mask costs, and the overall cost of doing the design.

Another economic problem is that it now takes 18 times as many scientists as in the 1970s to maintain Moore's Law. So research output is 18 times less effective in generating economic value.

This combination of fewer products that can take advantage of advanced fabs, and the dramatically higher cost of developing a process and building a fab means that today there are just three advanced fabs left. When Sophie built Firepath (130nm), there were 22 companies capable of making it. Now there are four fabs doing 16/14nm and only three doing less.

In the early days, steppers were the size of a large office photocopier. They got larger and more expensive until we got to EUV. Sophie explained how EUV works, including the light source, but I've discussed that enough to leave that bit out.

The conclusion from all of what she presented to this point is that leading design are OoO processors doing six operations per peak cycle. They are not very energy efficient but can be made to run fast (4GHz+). Gains beyond six ops per peak cycle are very limited and hard to attain. Almost all the communication in a superscalar OoO processor is local so can be made to run fast. But the bypass unit that connects results back to the inputs is quadratically expensive, so going to seven ops per cycle means going from 36 to 49 routes from outputs to inputs. So there is no easy way out of this hole.

Here are the state-of-the-art processors in 10nm. Note the blue rectangle at the back which is Intel's Skylake. Intel continues to focus on scalar performance but pays the cost in power and area compared to the various Arm derived processors.

Unconventional Processors

The alternative is what Sophie calls "unconventional designs" using VLIW. For the applications for which they are designed, these tend to be very power efficient. Cadence's Tensilica even made a brief appearance.

So we are now in a transition from general-purpose to somewhat specialized DSPs, GPUs, and now deep learning engines. For example, the Google TPU has 64K 8-bit integer multipliers...and you can use more than one TPU in parallel. You can get 3-5 TOPS/W. For more about the TPU, see my post Inside Google's TPU.

Changing How We Package

Another alternative is advanced packaging where, for example, AMD has a big die on 14nm and the little on 7nm (on the right above). Intel with Foveros is doing the same and mixing geometries (on the left). For more about that see my posts System in Package, Why Now? and System in Package, Why Now? Part 2

cooling direct A further possible development is to cool the chips directly. Today we cool the external package. But combining these approaches, 3D packaging with direct liquid cooling, remains a big problem.

Or how about wafer-scale-integration. Sophie wrapped up with a quick look at the Cerebras "chip" that is actually the largest rectangle you can manufacture on a 12" wafer. I wrote about that in The Biggest Chip in the World.

Q & A

Q: Do you think security has been prioritized sufficiently through history?

It depends if you mean successful processors, since they have all ignored security in place of performance. But computer science, especially outside of the US, have focused on security more but it didn't catch on since the people who didn't care about security made something that was faster.

Q: Do children have enough exposure to the underlying hardware?

The most alarming statistic is the rate at which computer architects are dying and not being replaced by younger ones. There are areas where these things are not taught. For schoolchildren, it is nice for them to have an appreciation that there is understandable stuff there that shows how the world is run. But in the Western world, there is a lot of ignorance on how things are actually accomplished.

Q: New smaller lithography is always expensive initially. Is it permanently more expensive?

Costs do come down a bit. At first, the fab itself isn’t getting very good results with a new process and yield is quite low. It used to be case in 28/65nm days that 15-20 masks was enough. In 10nm designed without EUV about 100 masks. That translates into costs in obvious ways. Takes longer and it is cost and that's not going away. In materials, copper wires are sort of waveguides. They are so small that if we use cobalt or ruthenium, they dont’ have the same skin effects so they become relatively better conductors. So we have to use exotic materials, abut then we have to have to protect the other bits of the design against these materials.

Q: Two quick questions. Any signs of an alternative to silicon?

Maybe a few years ago I’d have said quantum devices are not using silicon, but now most quantum research is trying to get it into silicon since manufacturing is so advanced. Quantum has gone more silicony. There is a little evidence of other materials. Intel has phase change materials in Octane. But silicon remains where all the research money is being spent.

Q: Last question. Will we ever see another significant increase in perceived performance? Will it come from hardware, software, quantum?

We have seen a significant change in special-purpose processors. Things have rocketed ahead. Most obvious in deep learning: you can talk at your iPad and expect it to transcribe your speech, impossible with a normal processor. Your iPad locally can be pointed at some text, do character recognition, translate with Google translate, put back into characters. In general areas, I’m quite pessimistic.

Watch the Lecture

The lecture was recorded and you should be able to view it.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.