Inside Google's TPU

19 Nov 2018 • 6 minute read

At the Linley Fall Processor Conference recently, the keynote on the second day was by Cliff Young of Google titled Codesign in Google TPUs: Inference, Training, Performance, and Scalability. The TPU is Google's Tensor Processing Unit. It is actually a family, since, as Cliff detailed, it is now on its third version. One of the things that makes Google's TPU so interesting compared to other designs is that it is deployed at scale in Google's datacenters. When you use Android or Google Home and say "OK Google" or "Hey Google" then the voice processing is done by TPUs in a production environment. One of the motivations for Google developing TPU in the first place was apparently a calculation that they would need dozens more datacenters if they did all the anticipated voice processing on regular datacenter servers.

I'm going to take two posts to cover the TPU. Today will cover the hardware story. Tomorrow will cover the software and benchmarking. Of course, the two are intertwined, so don't complain if software is mentioned today or hardware tomorrow.

Cliff started out by pointing out that every aspect of Google's business, including hardware, is being influenced by AI. There are three generations of the TPU (so far).

As Cliff said:

We are building our own chips but we are building them at scale. That’s a Google thing, we believe in warehouse computing.

TPU v1

tpu v1 TPU v1 has been in datacenters for 3 years now, since 2015. Google disclosed it after a year in 2016, at Google I/O after AlphaGo had beaten the world go champion. The initial chip was only to do inference to address the potential disaster of speech overwhelming even Google's datacenter capacity. It used 8-bit arithmetic, with a fallback to 16-bit if that turned out not to be enough. There is no floating-point. The original paper talked about 6 product applications, but there are way more than that now. It is also powering research inside Google, so it's not just the speech stuff that is most visible.

ASIC designs like TPU are potentially on a spectrum as regards programmability. On one end you can license Arm and build a sea of cores, and at the other end you can build a fixed-function block. Google picked a place in the middle for TPU, with what they thought was a good balance between programmability and performance. In particular, in a field where the ground is shifting almost daily, there needs to be the right amount of flexibility to handle future models, about which little is known. The TPU v1 was designed in 15 months and it has 15-30X the performance of contemporary CPUs, and 30-80X the performance per watt of contemporary CPUs and GPUs.

TPU v2

tpu v2 The TPU v2 is generally available, in the sense that you can rent time on it through Google cloud. It has 180 teraflops of computation, 64 GB of HBM memory, 2400 GB/s memory bandwidth. It is designed to be connected together into larger configurations.

It also is designed for both training and inference. It still has 8-bit but also has floating point. Whereas TPU v1 was designed like an old-style GPU to be a co-processor to a host CPU, v2 is designed to be networked (although still connected to a host server).

They kept the matrix multiply unit from v1 but also added a general purpose scalar unit and a general purpose vector unit. Cliff said that in some ways it is like an old Cray supercomputer. Cray had the best performance scalar units on the planet since, due to Amdahl's Law, a major limit on overall performance is the part of the code that cannot be vectorized.

The v2 chip has 2 cores, with 22.5 TFLOPS per core. There are 4 chips per 180 TFLOP Cloud TPU unit. Mostly it uses 32-bit floating point. But it also supports bfloat16 (the "b" stands for "brain"). Normal IEEE fp32 has an 8-bit exponent and a 23-bit mantissa. The IEEE fp16 half-precision has just a 5-bit exponent and a 10-bit mantissa. It was designed with graphics in mind, and is good for high-definition rendering. But for machine learning, it is the wrong tradeoff, since the sum of tiny differences is important. So Google invented a new floating point representation, bfloat16, which has the 8-bit exponent of fp32 but just 7 bits of mantissa. In the MACs, the multiplies can be bfloat16 and the addition fp32. In the Q&A, it was pointed out that Intel has committed to support bfloat16.

The cloud version of v2 has been available since last year, the TPU v2 pod is now in alpha, built into supercomputer configurations. The TPUs are in the blue section in the middle of the above picture. It has 11.5 petaflows, 4 TB of HBM, and a 2-D toroidal mesh network.

As I said above, the TPU v2 was designed for training as well as inference. Training is about 3X the computation of inference: forward propagation (as in inference), back-propagation, and weight update. There are also much longer data storage lifetimes, which puts more pressure on memory capacity and bandwidth. There are huge training datasets, and more regular changes to algorithms and model structure, which requires even more flexibility.

TPU v3

Cliff could only say "a few things" about TPU v3. It is liquid cooled (as you can see from the picture). The pod size is increased to 8 racks, as you can see from the above picture revealed at this year's Google I/O. It is currently in beta. In the Q&A later, Cliff was asked whether the connections were optical. His carefully worded response was:

I don’t think we’ve disclosed that so I’m not going to say. But if you looked at the box at Google I/O I can say they were electrical connectors. How’s that?

The performance is 420 teraflops (per unit) with 128 GB HBM. The v3 pod has over 100 petaflops, 32 TB of HBM.

In the Q&A someone asked how important it was to scale beyond 256 nodes. Cliff said:

We already have. TPU v3 is more, I think we said 1024. We do look at different network topologies, all the stuff in supercomputer design, and even datacenter design. But demand from the brain side is “a teraweight machine”. There is a collision between how many pods we can fit in a datacenter and how many datacenters can we buy.

Edge TPUs

Google has announced their edge TPUs. However, they haven't announced the specs yet, and Cliff wasn't talking. They are the tiny chips on the cent coin in the photo below on the right.

One their website, all Google says is:

Edge TPU is Google’s purpose-built ASIC designed to run AI at the edge. It delivers high performance in a small physical and power footprint, enabling the deployment of high-accuracy AI at the edge.

Summary

Google have been making "relentless progress":

TPU v1, deployed 2015, 92 teraops, inference only.
TPU v2, cloud TPU 2017, pod 2018, 180 teraflops, 64 GB HBM, training and inference, generally available. 11.5 petaflops in a pod.
TPU v3, cloud beta 2018, 420 teraflops, 128 GB HBM, training and inference, beta. >100 petaflops in a pod.

tpu family

Part 2 tomorrow.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.

"+ res.PreviousPostTitle); // //NextPostUrl // //Previousposturl // } // }); }); if ( $('.blog-post.nextweb-blog-post .ifrmesrc').length ) { iframeattr = $('.blog-post.nextweb-blog-post .ifrmesrc'); markup = ''; $('.blog-post-content .ifrmesrc').html(markup); $('.blog-post.nextweb-blog-post .ifrmesrc').show(); } -->