NOT CHIPS: Tesla's Project Dojo

8 Sep 2021 • 5 minute read

Nope, NOT CHIPS is not a misprint. Tesla didn't present at HOT CHIPS. But they did have an AI Day just a couple of days later. What their team presented was pretty mind-blowing. There was a lot of stuff about self-driving software, training, and generally about how self-driving using just cameras is being implemented. The scale of training required, especially when everything needed to be retrained to get rid of radar, is enormous. If that's what you are interested in, then you should watch the whole video.

But I'm going to pretend Tesla presented this at HOT CHIPS and focus on Project Dojo, which is Tesla's massive scale training infrastructure. This section was presented by Ganesh Venkataramanan, who leads Project Dojo.

Project Dojo

Project Dojo works and optimizes at multiple levels, with a chip as the foundation. These chips are then built out into computation arrays, powered, and cooled. Then this is taken to the cluster stage, which is really a whole data center. On top is a software stack that makes it all work, but I'm going to focus on hardware in this post, although like any modern hardware, how well it functions in practice is also a function of its software stack. We have the same challenge in Cadence since Palladium, Protium, and even Tensilica processor IP only function as well as the software that feeds the hardware.

The D1 Chip

The lowest level block, a "training node" in Tesla-speak, is 64-bit superscalar CPU optimized around matrix multiply units and vector SIMD. It supports FP32, BFLOAT16, and a new format CFP8 (configurable FP8). It is backed by 1.25 MB of fast ECC-protected SRAM, and a low-latency high-bandwidth fabric. It delivers 1 teraflop of compute (BF16 and CFP8) or 500 gigaflops (FP32). It has a custom ISA optimized for machine learning (ML).

It has 512 GB/s in all four cardinal directions, meaning that training nodes can be abutted in any direction to scale out to much bigger compute planes. In fact, 354 of these can be abutted into what Tesla calls a "compute array", capable of delivering 362 TFLOPS. At the edge it has 576 112G SerDes, so 4TB/s per edge.

This compute array is the D1 chip, manufactured in 7nm technology. it is 645mm2 with 50B transistors and over 11 miles of wire (only car companies think in terms of miles!). It dissipates 400W. It is in a flipchip BGA package. Here's one.

Here's why Tesla is so proud of this chip. I/O bandwidth is on the vertical scale, and teraflops of compute is on the horizontal scale. It outperforms Google's TPU3, GPUs, and deep-learning chips from startups. D1 is the chip way up and to the right.

Compute Tile

Since D1 chips can connect without any glue, Tesla just started putting them together. How about putting together 500,000 training nodes? This is 1500 chips seamlessly connected to each other.

There are then Dojo interface processors to allow seamless connectivity to standard data center host processors.

To build this compute plane, Tesla had to come up with a new way of putting the chips together to create what Tesla calls a "training tile", the unit of scale for the system. The above images shows 25 known-good die (KGD) integrated onto a fanout-wafer process, that preserves the bandwidth between adjacent D1 chips. At the edge, they created a connector that preserves the bandwidth. This training tile delivers 9 petaflops of compute, and 36TB/s of off-tile bandwidth. This is (perhaps) the biggest organic MCM in the chip industry.

The next challenge was how to power this tile. They created a custom voltage regulator module that could be PCB-style re-flowed directly onto the fanout wafer.

But more is required. This shows how a fully integrated and powered and cooled training tile is implemented. The video has a wonderful animation of how this goes together. It takes 52V DC and draws 18,000 amps. It dissipates 15KW of thermal. The key thing is that the compute plane is orthogonal to power supply and cooling. The only thing I've seen that compares to this is the support infrastructure required to power and cool the Cerebras wafer-scale chip.

So this is a 9 petaflop training tile in less than one cubic foot. But this is just the unit of scale for bigger systems.

And here is one for real.

Compute Cluster

To build a compute cluster, Tesla just "tiled together some tiles". There are 2x3 tiles in 2 trays in a cabinet. That gives 100+ petaflops per cabinet.

Next, they:

broke the cabinet walls and integrated seamlessly all the way through, preserving the bandwidth

That got them to an ExaPOD. It contains 120 training tiles in 10 cabinets. More than 1 million training nodes. 1.1 exaflops (at 16 bit). For comparison, Fugaku, currently the #1 supercomputer in the world (see my post Japanese Arm-Powered Supercomputer Takes the TOP500 Crown) is "only" 415 petaflops (although greater than 16-bit, for sure). Of course, Fugaku exists, and Dojo won't exist until early 2022. But it will be up there among the most powerful supercomputers.

Ganesh went on to talk about software, which is clearly important. But as I said at the start, I was only looking at the hardware. But here is the software stack anyway. If you have any interest in learning more, I recommend watching the video. I'll just wrap up with one anecdote. In the Q&A afterwards, Elon Musk was asked about what the standard for success was. He said it was when the software guys didn't want to use the large GPU-based cluster anymore (this is the one I covered in Tesla Goes All-In on Vision...and Supercomputers).

The Video

Here is the whole video. You can skip to 47m where the actual presentation begins. The part on Project Dojo starts at 1h 45m.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.