Ten Lessons from Three Generations of Google TPUs

20 Sep 2021 • 7 minute read

In my post about the big trends from this year's HOT CHIPS, I mentioned a paper from ISCA written by a group of people from Google. The paper is not actually titled exactly as in the title to this blog post, it is actually Ten Lessons From Three Generations Shaped Google’s TPUv4i (I think you need to be an IEEE member to read this for free). Some of the lessons are specific to AI training and inference, or even specifically to Google. But most of the lessons are more general, and are applicable to anyone designing systems for deployment at scale, especially over multiple generations. So clearly Cadence's own hardware system families, Palladium and Protium, would qualify. If you are designing an SoC for mass deployment, your design qualifies, too. So I thought I'd look at the ten lessons and generalize them a bit.

The abstract tersely summarizes the ten lessons:

Google deployed several TPU generations since 2015, teaching us lessons that changed our views: semiconductor technology advances unequally; compiler compatibility trumps binary compatibility, especially for VLIW domain-specific architectures (DSA); target total cost of ownership vs initial cost; support multi-tenancy; deep neural networks (DNN) grow 1.5X annually; DNN advances evolve workloads; some inference tasks require floating point; inference DSAs need air-cooling; apps limit latency, not batch size; and backwards ML compatibility helps deploy DNNs quickly. These lessons molded TPUv4i, an inference DSA deployed since 2020.

google tpus 1, 2, and 3

For this post, I'm going to assume you know the background on the Google family of TPUs. I wrote about the first three versions of the design back in 2018 in my posts about Cliff Young's Linley keynote Inside Google's TPU and Google TPU Software. Above is an image of all three of them, one on the top left (air-cooled), two on the top right (air-cooled), and three at the bottom (water-cooled).

So, here is the Top Ten List of lessons from these processors, although unlike a David Letterman Top Ten List this goes from 1 to 10. The first lesson is one I already mentioned in my review of the HOT CHIPS conference HOT CHIPS: The Big Trends.

Lesson ① Logic, Wires, SRAM, and DRAM Improve Pnequally

Horowitz’s insights on operation energy inspired many DSA designs. This was recently updated to 7nm, showing an average gain of 2.6X from 45nm, but the change is uneven:

SRAM access improved only 1.3X–2.4X, in part because SRAM density is scaling slower than in the past. Comparing 65nm to 7nm, SRAM capacity per mm2 is ~5X less dense than ideal scaling would suggest.
DRAM access improved 6.3X due to packaging innovations. High Bandwidth Memory (HBM) places short stacks of DRAM dies close to DSAs over wide buses.
Energy per unit length of wire improved <2X. Poor wire delay scaling led TPUv2/v3 to use 2 smaller cores from 1 larger core on TPUv1. Logic improves much faster than wires and SRAM, so logic is relatively “free.” HBM is more energy-efficient than GDDR6 or DDR DRAM. HBM also has the lowest cost per GB/s of bandwidth.

Lesson ② Leverage Prior Compiler Optimizations.

Since the 1980s, the fortunes of a new architecture have been bound to the quality of its compilers. Indeed, compiler problems likely sank the Itanium’s VLIW architecture.Yet many DSAs rely on VLIW including TPUs (and Tensilica processors). Architects wish for great compilers to be developed on simulators, yet much of the progress occurs after hardware is available since compiler writers can measure the actual time. Reaching an architecture’s full potential quickly is much easier if it can leverage prior compiler optimizations rather than starting over from scratch. Note, in particular, the conclusion that compiler compatibility is much more important than binary compatibility.

Lesson ③ Design for Performance per TCO vs per CapEx

TCO is the total cost of ownership, which is the original cost of the system (amortized over the three or so years the system has value), plus all the other costs such as electricity and cooling for three years.

Lesson ④ Support Backwards ML Compatibility

This principle applies to almost any processor, not just in machine learning: it should run the existing software loads unchanged (and get the same results!). That might not be the best way to use it but it is a good place to start. This is especially easy in machine learning since software like TensorFlow and PyTorch operate at levels much further removed from the hardware than, say, a C-compiler.

Lesson ⑤ Inference DSAs Need Air Cooling for Global Scale

water cooled tpu 3 The 75W TPUv1 and 280W TPUv2 were air-cooled, but the 450W TPUv3 uses liquid cooling. Liquid cooling requires placing TPUs in several adjacent racks to amortize the cooling infrastructure. That placement restriction is not a problem for training supercomputers, which already consist of several adjacent racks. This was the point that Dave Ditzel took from this paper, and is one reason that Esperanto tried so hard to get the power down so water cooling would not be necessary. By the way, Palladium is available in both water-cooled and air-cooled version. Like a training supercomputer, it is already several racks so water cooling can make sense since it is much more efficient simply to pump water to the roof than to blow air through the unit heating up the data center, and then using conventional air conditioning to get the heat to the roof. However, many data centers are simply not equipped for water cooling. This might surprise gamers, since any serious gamer has an overclocked water-cooled rig at home.

Lesson ⑥ Some Inference Apps Need Floating-Point Arithmetic

Tenslica processors have had an optional floating point unit since nearly the beginning. Just recently, we announced a family of floating-point optimized processors. See my post Tensilica FloatingPoint DSP Family.

Lesson ⑦ Production Inference Normally Needs Multi-Tenancy

Among other things, this is just good software engineering practice. Palladium and Protium both support multi-tenancy, allowing multiple users to make use of the system at the same time (subject, as always, to the constraint that you can't emulate or prototype more gates at once than the system can overall support).

Lesson ⑧ DNNs Grow ~1.5X/Year in Memory and Compute

I've actually seen much bigger numbers than this. Most graphs I have seen show that the number of weights in neural networks is growing much faster than Moore's Law. For example, here's a graph from Linley Gwenapp's keynote in May that I showed in my post Linley: Driving AI from the Cloud to the Edge. It shows many models growing as much as 40X per year.

Lesson ⑨ DNN Workloads Evolve with DNN Breakthroughs

There have been lots of changes in models over the last few years. I think one of the most significant for edge inference devices like Tensilica processors is the amount of optimization that can be done using sparsity and quantization, without losing any accuracy. You would think using 8-bit instead of 32-bit floating point would make a big difference, but in a neural network a lot can be compensated with adding more weights, and not as many as you would think. Even 1-bit models can be effective!

Lesson ⑩ Inference SLO Limit is P99 Latency, Not Batch Size

SLO is a service level objective, such as identifying X images per second. P99 latency is the latency that is expected to be met 99% of the time, how fast images are (almost) always identified. The batch size is how data, in this case images, is aggregated so that the device can stream the data through (identify all the images on my laptop), or not (identify each image on a phone as the picture is taken). This is a bit too technical for this post, so I'll let you read the paper if you are interested. I would say that for inference at the edge, where Tensilica is used, there is usually little flexibility in batch size (you can queue up all your commands for your smart speaker and say them all at once!). Google's TPU is used for inference in its cloud data centers, so not at the edge.

Learn More

Now you know how to put the "ten" in "Tensilica"! I wrote recently about the whole family in my post On-Device Artificial Intelligence the Tensilica Way.

The Tensilica product page has details on the whole family. If you want all the details of Google's full paper, I linked to it at the start of the post.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.