Get email delivery of the Cadence blog featured here
At the recent HOT CHIPS, the Sunday morning tutorial was on scale out of deep learning training. I covered the introduction in my post HOT CHIPS: Scaling out Deep Learning Training. The second half of the morning was devoted to some of the biggest scale-out processors around.
As Shamir Kumar and Dehao Chen from Google described it in their presentation, it is the space race for the biggest ML machine. The ones they called out were:
Microsoft didn't present in this session of HOT CHIPS (they did present the latest Xbox in the main conference). But Cerebras did. So there were three presentations: NVIDIA on DGX A100 SuperPOD, Google on the TPU Pod, and Cerebras on the generically titled Cerebras System (CS-1) which is based around their wafer-scale chip.
For more details on the Google TPU, see my post Inside Google's TPU. For more on Cerebras's "chip" see my post from last year HOT CHIPS: The Biggest Chip in the World, which is about the chip and how it was constructed, or Weekend Update 2, which contains more about the system that it actually sells, the CS-1.
Michael Houston presented this truly enormous machine. It is #1 on MLPerf for commercially available systems (which is cheating a little since you can't buy a Google TPU, which was #1 on MLPerf overall). It is #7 of the TOP500 list of supercomputers with 27.6 PetaFLOPS HPL, and it is #2 on the Green500 (for energy efficiency) at 20.5 GigaFLOPS per Watt). It is the fastest industrial system in the US at over an exaFLOPS for AI training.
Perhaps more surprising, given how supercomputers seem to take years to build normally, it was built with NVIDIA DGX SuperPOD architecture in just three weeks, during lockdown. Under the covers, it has:
Diving a bit deeper, here's the spec for a single DGX A100 that is the main building block of this huge machine:
One of the challenges with building big systems like this is that they don't last forever before they need to be expanded. The trick is to build them in a modular way so that they can be expanded without having to take the system down: add more cabinets and plug them in with wires (or fiber optic cables). The diagram below shows the layout of a module. The compute units are the five greyish columns on the left, with the networking in the third rack. The management and fabric to connect to the rest of the data center is in the seventh unit, and storage is in the 8th.
Each superpod is 140 systems and was built in less than 10 days. It was designed to be deployable during quarantine by two engineers doing 20 systems per shift. The maximum they achieved was 60 systems in one 2-shift day, and that was their limit since they couldn't get more trucks into the loading dock. The cables were all built off-site like giant automotive wiring harnesses. The procedure was rack, connectivity check, automatic provisioning, burn-in, identify issues, fix, handoff. The average time from racked to user running was four hours.
The emphasis of Google's presentation by Shamir Kumar and Dehao Chen was on the TPU-v3 Multipods with 4096 TPU-v3 chips. You've probably seen pictures of these before, but in case not, here's what they look like. On the left is the board, which you can see contains 4 TPUs and is water-cooled (that's what the colored pipes are). On the right is a multipod. It delivers 100+ PetaFLOPS, has 32TB of HBM, and has a 2-D toroidal network (one dimension going vertically in its rack, the other running across the whole pod horizontally). There are 1024 chips in this pod.
But they have gone further and linked up four of these pods to form an even larger pod. Despite the picture below, they don't stack them on top of each other. This has 4096 TPU chips, delivers over 400 PetaFLOPS, and has a 128x32 mech network topology (instead of toroidal).
When they scale training onto these systems, they try and maximize using vertical communication (within a pod) and minimize horizontal (between pods) because of the performance difference. They went into more detail about how they scale training, the main takeway being that they automated a lot of it in the compiler, but it is hard anyway. But the results speak for themselves. This machine won the lastest MLPerf benchmark round announced just recently.
Natalia Vassilieva started by introducing the chip itself. The CS-1 is built around the wafer-scale engine, which is the largest square die you can fit on a 300mm wafer. There's a picture above along with some statistics, if you've never seen it before. The chip actually goes into a module that looks like this (on the right) that they call the engine-block. That ends up in a system on the left once you add all the cooling and power supplies.
I think that Cerebras's focus is on running training on a single system, although in the Q&A it did say that when the model gets too big, like everyone else, they scale out. Here's what that looks like:
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.