DAC 2020: TSMC Keynote

31 Jul 2020 • 6 minute read

The opening keynote at DAC was TSMC's Chief Scientist Philip Wong. That's clearly not nearly enough work, so he's a professor at Stanford, too.

As I said in my post DAC Preview 2020, I've seen a version of this presentation before when Philip gave one of the keynotes at last summer's HOT CHIPS.

Philip was introduced by DAC Chair, Cadence's Zhuo Li.

Nothing Is 5nm on a 5nm Chip

TSMC's Philip Wong complains in his abstract that:

Since its inception, the semiconductor industry has used a physical dimension (minimum gate length of a transistor) as a means to gauge continuous technology advancement. This metric is all but obsolete today. Density is what drives the benefits of new device technologies for computation—the primary application driver for semiconductors.

Over four years ago, less than a month after I rejoined Cadence and Breakfast Bytes began, I wrote a post Where Does 5 Really Mean 30? Process Node Naming where I complained about the same thing. Actually, not quite the same thing. Philp Wong was more focused on what we should actually use as measures, and less on what we should call them. But what we call them is important, too.

Here's the key fact, if you don't know it: there is nothing that is 5nm on a 5nm chip.

Philip Wong's Keynote

Philip's title slide also credited Kerem Akarvardar of TSMC.

Philip started by focusing on data movement in deep learning processors:

Data movement today is the key problem, since data movement is very expensive in both energy and latency.
...
More and more energy is wasted in data access and a small portion is consumed in the compute circuitry. Caching becomes less and less effective for these computing workloads.
...
While this is a challenge, it is also an opportunity for massive gains if we can focus on developing technologies with with system performance in mind.

Next, a short history of semiconductor since 1970, and the importance of scaling. Until about 2000 or so we had Dennard scaling, what I've also heard being called "happy scaling". You got more transistors, they were faster, and they were lower power, so in aggregate we could increase the clock rate at constant power. Then for much of a decade, we carried on as if Dennard scaling still worked. We got the transistors, and the increased performance, but we also got more power until finally we reach the thermal limit and had to limit clock rates. Next, we did channel geometry scaling with high strain and then FinFET. In the last few years, we've had to go to Design Technology Co-Optimization (DTCO) where we add special features to the process that allow us to make a one-time gain of knocking a track out of the track height of our standard cells. Through almost the entire period (perhaps until DTCO) it didn't matter what we looked at, we got almost exactly the same scaling, on the same exponential (slope in the above diagram). So a single inverter, SRAM, logic areas, microprocessor transistor density—all gave the same answer.

Next, a thought experiment. What if we somehow got much faster transistors (like we have done) but without the scaling. How good would that be? Not very:

Not enough memory
No multi-core
No accelerators
Wire delay too long and energy consumption too high (too far between transistors)

Looking at how to improve performance further, and continue to extend Moore's Law for system performance, the above "roofline plot" shows what happens as we increase the peformance we demand (x-axis) and see what we actually get (y-axis). The name "roofline plot" obviously comes from the shape of the graph. When we don't demand much performance, we are limited by memory bandwidth. If we have lots of memory bandwidth, we cap out running all the CPUs at full power (basically a power/thermal issue). You can see the formulas for memory bandwidth and processor peak throughput. The easiest way to increase memory bandwidth is to increase the bus width (because we already have that technology so we know how to do it). And the easiest way to increase processor throughput is to add more cores (again, because we know how to do it). The other options, increasing clock (or data) frequency, or increasing ops/cycle, are pretty much capped out.

So to improve system throughput, we need more transistors, more memory, and all combined in a package to improve system throughput.

First, focus on the processor throughput. It is almost entirely due to the number of cores. Clock frequency is a contributor, of course, but doesn't go up much, whereas the cores roughly double each processor generation in a GPU or a deep learning engine. The number of cores is pretty much proportional to the number of transistors on the chip and so has increased at 1.68X every two years for the last 15 years. In the last decade, when DTCO has been used extensively, actual transistor densities have increased even faster than Moore's Law, but this is not captured by conventional metrics such as contacted-poly-pitch*metal pitch (CPP*MP), the most basic measure of how much we can pack transistors in.

Next, memory.

As shown in this transistor count versus memory capacity plot [above], there is close to 1:1 correlation between the number of transistors and the amount of memory for all computing systems from mobile, to desktop, to server, to even the largest supercomputers in the world. It's a correlation over 8 orders of magnitude, it's quite amazing.

But that is just capacity. We have a bandwidth deficit, too. We can't really improve data rates due to power, but we can increase data widths. But, as Philip said:

this requires innovation in compute-memory integration

The above diagram shows three primary ways to integrate memory and logic:

Traditional 2D: Memory in DIMMs connected to the processor on a circuit board
2.5D with HBM memory stacks alongside the processor (how modern GPUs and many processors are built)
3D, with HBM memory stacked on top of the processor die and connected through the stack with TSVs

Still higher levels of integration can be obtained with monolithic integration, the N3XT (pronounced "next") system. As he explained:

"3" stands for 3D and I believe this is the kind of system you will see in the future. The vision is to have multiple layers of logic and memory. The memory layers span the gamut from high speed memory to high capacity memory.
...
In the last decade with existing 3D technologies, we've seen four orders of magnitude increase in I/O density. But if we go monolithic, the interlevel vias can be 100nm or below. So there's at least another four orders of magnitude to go as far as bandwidth improvement is concerned.

The roofline model shows us that three metrics are important for a semiconductor technology. We call these D_L, D_M, and D_C. These are logic density, memory density, and interconnect density between logic and memory. Units are numbers/mm². Today's public information shows these numbers to be about 100M, 200M, and 12,000 per mm². These three numbers will capture the most important attributes of a semiconductor process going forward.

Since he was at DAC, Philip wrapped up with a plea for tools to optimize and partition across die, not just within die. This will unleash innovation and democratize design, and make it more like the ecosystem in software, unlike today when only a few companies can afford to create systems at the most advanced nodes.

Ideally, it should be as easy to innovate in hardware as it is to write a piece of software code.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.