HOT CHIPS: The Biggest Chip in the World

22 Aug 2019 • 6 minute read

At HOT CHIPS earlier this week, Cerebras announced the biggest chip in the world. Sean Lie spent the first half of his presentation talking about how deep learning algorithms get more effective with more processors, how you need to handle sparsity, and how you should be able to program them with TensorFlow and all the other familiar environments. I'm just going to take that as given. I've written plenty about this, and there are lots of specialized deep learning processors available with similar basic capabilities. Cadence has the Tensilica DNA 100 Processor (see my post The New Tensilica DNA 100 Deep Neural-Network Accelerator for details) but anyone serious about deep learning, especially for inference at the edge, has something along the same lines. But the Cerebras processor is scaled in a very different way.

Wafer Scale Engine

More processors are good. So how about 400,000 of them! That's going to take up a lot of space. So how about a whole wafer. All the numbers are mind-bogglingly enormous. 46,225mm² die area (that's over 8" on a side). 1.2 trillion transistors. That is not a typo. It is over a trillion transistors. 18GB of on-chip memory. 100PB/s of memory bandwidth. 100Pb/s on-chip communication bandwidth. It is manufactured in TSMC's 16nm process, so they are only jogging not sprinting, with 7nm and beyond available for the future.

As Cliff Young, the session chair said:

This is perhaps the hottest chip that we've ever presented at HOT CHIPS.

This is the largest square that you can get out of a 300mm wafer, and is 56 times larger than a maximum reticle-sized GPU. In the Q&A, Cliff was asked why they didn't make it round and use the whole wafer, and he said it was mainly convenience, although it was something they would consider in the future.

The second half of their presentation was about the challenges of doing this. How do you manufacture it? How do you handle inevitable defects? How do you get power to it, since it's too much for normal PCB power planes? How do you package it? How do you cool it?

Challenge #1: Cross Die Connectivity

When a wafer is manufactured, lithography is done one die at a time. The pattern for that layer is exposed on one die position, and then the wafer is moved inside the stepper to the next die, and it is then exposed. After the entire wafer has been manufactured, a diamond saw is used to cut the wafer up into individual die. Since the saw has a width, some space needs to be left for the saw and this is known as the scribe-line. It's called this since back in the 1970s and maybe even the 1980s, diamond saws were not used. Instead, the wafer was scribed (scratched) and simply broken up. In the same way that a bar of chocolate breaks along the lines, the wafer would break along the scribe lines and separate into individual die.

Another complication is that process control monitor (PCM) structures are put in the scribe line to check on the manufacturing process. These are not needed after manufacture, so they get sacrificed when the diamond saw runs up the scribe line. Since each die is normally separate, no signals run across the scribe line from one die to the next.

To manufacture this chip, Cerebras worked with TSMC so that they could run wires across from one die to the next. This extended the 2D mesh across the whole chip. They had the same connectivity across the scribe line as between processors on each die to create a homogeneous array. The wires are very short (<1mm) and so very high bandwidth. Sean didn't say much about it, but the wires across the scribe line presumably had looser design rules to the wires between cores on the same die. Although the overlay tolerance of one layer to the next on an individual die is very tight, perhaps 2nm, this is achieved by the way that the stepper aligns onto special patterns on each die known as alignment marks or vestigials. However, the tolerance from one die to the next is much coarser since the only effect of misalignment is to vary the width of the scribe line, which is destroyed by the diamond saw anyway. The Cerebras reticles must have had metal lines on left and right (and top and bottom) that joined up when the scanner did its step and repeat. There also probably needed to be some adjustments to the PCMs in the scribe to ensure they didn't interfere with the interconnect. I'm assuming that nothing crossed the scribe in the FEOL, which would mean that the PCM could be normal during transistor manufacture.

Challenge #2: Defects

There will be some defects even in a mature process, meaning that some of the cores will not work correctly. Cerebras added redundant processors in the array, and had additional connectivity between the processors. This meant that non-functioning processors could be locally bypassed. After configuration, the fact that this had happened would be invisible to the software, which would see a homogenous array of identical processors across the whole chip/wafer.

In the diagram, the black processor core is non-functional. So one of the very pale redundant processors at the top is called into action, and the routing between the processors adjusted to pull it into service, and to adjust the routing for intermediate cores too to preserve the homogeneity.

Challenge #3: Thermal Expansion

If they mounted the die directly on the board, then the different thermal expansion of the chip and the board would crack the die, which is obviously not a good thing. Instead, Cerebras invented a custom connector to mount the wafer on the PCB. This maintained electrical connectivity while also absorbing the variation.

Challenge #4: Package Assembly

No equipment exists to handle the packaging. The printed circuit board, the custom connector, the chip itself, and the cold plate on top, all need to be precisely aligned. So Cerebras developed custom machines for handling this to ensure precision alignment.

Challenge #5: Power and Cooling

The next challenge was getting power into the chip. The normal way to do this is to have power planes in the PCB, but these don't have enough current-carrying capacity to deliver enough power across the whole chip, and only the edge of the chip would have been properly powered. Also, air cooling would be inadequate to get the heat generated by the chip.

So power was delivered through the circuit board from a special layer underneath, and the cooling was solved by using water cooling, with water flowing through channels in the cold plate.

It's Working

Sean wrapped up by telling us that it's working and it's running customer workloads. As he said:

We built the largest chip in the world, it gets cluster-scale performance to solve today's hardest deep learning problems.

In the Q&A, he was asked what that performance was, but said they were not disclosing that yet. They would be talking more about the overall system performance in the coming months. "Stay tuned."

Of course, he had brought one with him to show us.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.

"+ res.PreviousPostTitle); // //NextPostUrl // //Previousposturl // } // }); }); if ( $('.blog-post.nextweb-blog-post .ifrmesrc').length ) { iframeattr = $('.blog-post.nextweb-blog-post .ifrmesrc'); markup = ''; $('.blog-post-content .ifrmesrc').html(markup); $('.blog-post.nextweb-blog-post .ifrmesrc').show(); } -->