Never miss a story from Breakfast Bytes. Subscribe for in-depth analysis and articles.
On April 22, Tesla held its Autonomy Day. They announced their "Self-Driving Computer" or SDC. (You can read my post from back then in my post Tesla Drives into Chip Design.) I have said several times over the years that I expected that the high-end automobile manufacturers would need to follow the high-end smartphone manufacturers and design their own chips. So I felt vindicated when Tesla announced that they had done just that. They call it FSD for Full Self-Driving Computer.
At HOT CHIPS recently, Debjit Das Sarma and Ganesh Venkataramanan presented Compute and Redundancy Solution for the Full Self-Driving Computer. Some of what they said covered material in that earlier post, so I'll try not to duplicate it. (That's a polite way of saying that I'm going to assume you read my earlier post.) All the images in this post are from their HOT CHIPS presentation.
Let's start with their goals, which were:
That led to goals for the FSD chip itself:
The new board contains dual redundant identical SoCs (in blue and turquoise in the picture). There are also dual redundant power supplies. In the Q&A, they said that the two chips could either be run redundantly or separately. They were designed for both modes of operation.
The form factor of the board and the connectors are backward compatible to the previous Autopilot computer. There are overlapping camera fields with redundant paths, as you can see on the right of the image.
The computer takes in information from sensors:
Here's the chip. It is manufactured in Austin in Samsung's 14nm FinFET process. It is 260mm2 with 6 billion transistors. It is already in production. In fact, on Autonomy Day, Tesla announced that it had already been shipping in several models for a few weeks. It is packaged in a 37.5mm square flip-chip BGA.
The chip contains a Tesla-designed neural network accelerator, along with third-party IP for CPU, GPU, ISP, H.265 video encoder, memory controller, PHYs, on-chip NoC, peripherals.
The chip contains two copies of the neural network accelerator which runs at 2+GHz. There are 96x96 MACs giving 36.8 TOPS (per accelerator). Hardware SIMD, ReLU, and Pool units. There is 32MB SRAM per instance. it is all 8-bit precision, no floating-point (in the MAC array).
A single convolution is a 7-deep nested for-loop. It ends up that 99.7% of operations are MACs. However, speeding up just the MACs by orders of magnitude makes quantization/pooling more performance sensitive (a sort of Amdahl's law for neural networks). So they have dedicated quantization and pooling hardware, too, to speed things up all around.
To keep power down, there is a flexible state-machine-based control logic with built-in loop constructs. This eliminates DRAM read and writes, and minimizes SRAM reads. It is a single clock domain with DVFS-enabled power and clock distribution.
The instruction set is shown in the above diagram. Note that this is not a sample of instructions, it is the complete instruction set. It provides just what is required, but with the flexibility to change the algorithms. There is limited out-of-order execution since DMA read, DMA write, and compute can take place simultaneously. There is no data movement in the grid (it is not a systolic array).
The microarchitecture is shown in the above diagram.
The programmable SIMD unit is shown. It has signed and unsigned int and FP32 arithmetic, with predication support for all instructions. There is a pipelined implementation of quantization that fuses ReLU, scale, and normalization layers. There is full SIMD program support with functions such as Argmax, Exponential, Sigmoid, and Tanh.
The result is that the performance is 21X what they achieved on the previous HW 2.5 generation, at a 1.25X increase in power to 72W (for two of them), and at 80% of the cost of the prior generation. Only 20% of the power (15W) is in the neural network accelerator.
Their summary was that:
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.