HOT CHIPS: The Tesla Full Self-Driving Computer

3 Sep 2019 • 4 minute read

On April 22, Tesla held its Autonomy Day. They announced their "Self-Driving Computer" or SDC. (You can read my post from back then in my post Tesla Drives into Chip Design.) I have said several times over the years that I expected that the high-end automobile manufacturers would need to follow the high-end smartphone manufacturers and design their own chips. So I felt vindicated when Tesla announced that they had done just that. They call it FSD for Full Self-Driving Computer.

At HOT CHIPS recently, Debjit Das Sarma and Ganesh Venkataramanan presented Compute and Redundancy Solution for the Full Self-Driving Computer. Some of what they said covered material in that earlier post, so I'll try not to duplicate it. (That's a polite way of saying that I'm going to assume you read my earlier post.) All the images in this post are from their HOT CHIPS presentation.

Let's start with their goals, which were:

Retrofit existing hardware-2 vehicles
Under 100W system power (it has to go behind the glove-box)
Low-enough parts cost to enable redundancy architectures
Focus exclusively on Tesla's requirements
Safety and security
Minimize software migration costs

That led to goals for the FSD chip itself:

>50 TOPS of neural network performance
High utilization (80%)
<40W per chip (because there will be two of them)
GPUs and CPUs for post-processing and general-purpose needs
Security and safety needs
Modular, to enable various platform redundant uses

The new board contains dual redundant identical SoCs (in blue and turquoise in the picture). There are also dual redundant power supplies. In the Q&A, they said that the two chips could either be run redundantly or separately. They were designed for both modes of operation.

The form factor of the board and the connectors are backward compatible to the previous Autopilot computer. There are overlapping camera fields with redundant paths, as you can see on the right of the image.

The computer takes in information from sensors:

Cameras
Wheel ticks
Radar
GPS
Maps
Inertial measurement unit (IMU)
Steering angle

The Chip

Here's the chip. It is manufactured in Austin in Samsung's 14nm FinFET process. It is 260mm² with 6 billion transistors. It is already in production. In fact, on Autonomy Day, Tesla announced that it had already been shipping in several models for a few weeks. It is packaged in a 37.5mm square flip-chip BGA.

The chip contains a Tesla-designed neural network accelerator, along with third-party IP for CPU, GPU, ISP, H.265 video encoder, memory controller, PHYs, on-chip NoC, peripherals.

The chip contains two copies of the neural network accelerator which runs at 2+GHz. There are 96x96 MACs giving 36.8 TOPS (per accelerator). Hardware SIMD, ReLU, and Pool units. There is 32MB SRAM per instance. it is all 8-bit precision, no floating-point (in the MAC array).

A single convolution is a 7-deep nested for-loop. It ends up that 99.7% of operations are MACs. However, speeding up just the MACs by orders of magnitude makes quantization/pooling more performance sensitive (a sort of Amdahl's law for neural networks). So they have dedicated quantization and pooling hardware, too, to speed things up all around.

To keep power down, there is a flexible state-machine-based control logic with built-in loop constructs. This eliminates DRAM read and writes, and minimizes SRAM reads. It is a single clock domain with DVFS-enabled power and clock distribution.

The instruction set is shown in the above diagram. Note that this is not a sample of instructions, it is the complete instruction set. It provides just what is required, but with the flexibility to change the algorithms. There is limited out-of-order execution since DMA read, DMA write, and compute can take place simultaneously. There is no data movement in the grid (it is not a systolic array).

The microarchitecture is shown in the above diagram.

The programmable SIMD unit is shown. It has signed and unsigned int and FP32 arithmetic, with predication support for all instructions. There is a pipelined implementation of quantization that fuses ReLU, scale, and normalization layers. There is full SIMD program support with functions such as Argmax, Exponential, Sigmoid, and Tanh.

Results

The result is that the performance is 21X what they achieved on the previous HW 2.5 generation, at a 1.25X increase in power to 72W (for two of them), and at 80% of the cost of the prior generation. Only 20% of the power (15W) is in the neural network accelerator.

Their summary was that:

This was a completely optimized SoC from scratch (14 months from architecture to tapeout)
Outstanding performance per Watt for Tesla's neural networks with the neural network accelerator
Enables full redundancy at optimal cost
The FSD computer will help enable new safety and autonomy levels in the future
You can own one today—they are shipping in all new Teslas

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.

"+ res.PreviousPostTitle); // //NextPostUrl // //Previousposturl // } // }); }); if ( $('.blog-post.nextweb-blog-post .ifrmesrc').length ) { iframeattr = $('.blog-post.nextweb-blog-post .ifrmesrc'); markup = ''; $('.blog-post-content .ifrmesrc').html(markup); $('.blog-post.nextweb-blog-post .ifrmesrc').show(); } -->