HOT CHIPS: The Big Trends

17 Sep 2021 • 5 minute read

Two big trends that are no surprise, they were big at HOT CHIPS last year and for other events like the Linley Processor Conferences, are every processor is an AI processor, and increasingly most processors are using some form of advanced packaging.

These two trends were easy to read off the agenda. HOT CHIPS always starts with a full day of tutorials on the Sunday preceding the conference proper, covering two major topics with multiple presenters This year the two topics were:

Tutorial 1: ML Performance and Real World Applications: Machine learning is a rich, varied, and rapidly evolving field. This tutorial will explore the applications, performance characteristics, and key challenges of many different unique workloads across training and inference. In particular, we will focus on hardware/software co-optimization for the industry-standard MLPerf benchmarks and selected applications and considerations at prominent cloud players.

Tutorial 2: Advanced Packaging: This tutorial will discuss advanced 3D packaging technologies that enable performance and density improvements. Descriptions of the technologies and how they are used in cutting edge applications will be made by industry leaders in packaging and chip design.

So I'm not exactly going out on a limb picking these two topics as the big trends that will be driving the design of very higher performance big systems for the foreseeable future. Obviously, the specialized AI processors are, but GPUs have been used for AI for a long time, and general-purpose CPUs such as the x86 ones and IBMs z-class mainframe, all have AI acceleration in the system. Not every high-end chip is using 3D packaging techniques, but many of them are. That is the second trend. The most dramatic being Intel's Ponte Vecchio (see my post HOT CHIPS: Two Big Beasts) which consists of 47 tiles manufactured in five different process nodes, not all of them Intel. I don't know if this counts as the most complex 3D-IC design ever done, but I don't remember seeing anything close.

I will pick a couple more trends though. First, power. Actually, not so much power delivery and analysis as thermal constraints and analysis. Those 3D packages have complex thermal analysis associated with them. Once you put one chiplet/tile on top of another, you have to worry about the heat getting out of the stack. Plus, most high performance designs have their clock rate constrained by thermal considerations more that critical path timing.

A second trend is logic continuing to increase in performance at something close to Moore's Law rates, but memories only improving in performance very slowly. In Dave Ditzel's presentation (or perhaps when I talked to him the week before) he mentioned a paper that I didn't know about. It was Ten Lessons From Three Generations Shaped Google’s TPUv4i (IEEE membership may be required), written by a dozen people from Google. The point Dave had picked up on was that inference chips must be air-cooled since they need worldwide deployment. If you expect people to drop a PCIe card into a million servers in existing data centers, that's just not going to happen if you need to add water cooling to each one. This does not apply to training since that already requires several adjacent dedicated racks.

But for this post, the important thing that stood out to me was was lesson ①:

Logic, wires, SRAM, and DRAM improve unequally. Horowitz’s insights on operation energy inspired many DSA designs. This was recently updated to 7 nm, showing an average gain of 2.6X from 45 nm, but the change is uneven:

SRAM access improved only 1.3X–2.4X, in part because SRAM density is scaling slower than in the past. Comparing 65 nm to 7 nm, SRAM capacity per mm2 is ~5X less dense than ideal scaling would suggest.

DRAM access improved 6.3X due to packaging innovations. High Bandwidth Memory (HBM) places short stacks of DRAM dies close to DSAs over wide buses.

Energy per unit length of wire improved <2X. Poor wire delay scaling led TPUv2/v3 to use 2 smaller cores from 1 larger core on TPUv1. Logic improves much faster than wires and SRAM, so logic is relatively “free.” HBM is more energy-efficient than GDDR6 or DDR DRAM. HBM also has the lowest cost per GB/s of bandwidth.

Most of the processors (and I use the word in the most general sense to include CPUs, GPUs, TPUs, and other more esoteric systems) are designed more to optimize memory flow than worrying so much about the actual computation. That is especially true for the AI processors which are all large arrays of multipliers. In fact, if you think about the big processors for data centers, all the out-of-order instruction processing, the branch prediction, the caches, these all go together to do their best to hide the fact that processor logic has got a lot faster and DRAM has largely stayed at roughly the same performance with more of a focus on increasing capacity. The mismatch is huge. IBM announced a 5GHz mainframe so that has CPU cycle time of 200ps but a DRAM access is more like 100ns. So one lesson is that the design of these big systems is increasingly about memory architecture and how data is accessed and flowed through the chips.

Just one datapoint, Microsoft's presentation during the tutorials on the Sunday was titled ZeRO-Infinity and DeepSpeed: Breaking the Device Memory Wall for Extreme Scale Deep Learning.

I mentioned thermal above. All the big designs have thermal challenges (there's a reason the conference is called HOT CHIPS...those chips are hot). But increasingly performance of cores on a chip are varied for thermal reasons. The focus of Intel's Thread Director Technology (see my post HOT CHIPS: The Next-Generation of General-Purpose Compute) is on getting good single thread throughput when required, and good throughput when it is not, but one of the big things Thread Director seems to do is monitor the cores for thermal issues and (in tandem with the operating system) move computation around the system to keep it manageably cool.

Summary

So there are my big trends:

AI in everything
Advanced 3D packaging
Increased focus on data movement and memory
Thermal issues

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.