HOT CHIPS: Two Big Beasts

16 Sep 2021 • 3 minute read

Today, I'm covering two of the biggest chips presented at the recent HOT CHIPS. Ponte Vecchio, Intel's enormous graphics chip intended to be used in Argonne National Laboratories' supercomputer Aurora. And the latest developments from Cerebras (not to be confused with Cerebrus, Cadence's AI-enabled digital design environment).

Ponte Vecchio

Intel went into a fair bit of detail about the architecture of the core and so forth. Officially it is the HPC version of Intel's Xe GPU architecture. But in many ways, the most interesting thing about Ponte Vecchio is the scale and the manufacturing challenges.

By the way, Ponte Vecchio means "old bridge" and Ponte Vecchio is indeed the oldest bridge in Florence. Trivia question of the day: what is the oldest bridge across the Seine in Paris? Answer: Pont Neuf, which means "new bridge". Of course, there used to be older bridges but they were all replaced.

So here is how it is put together. There are over 47 tiles (chiplets) manufactured on five different process nodes. It is over 100B transistors. The compute tiles are built on TSMC's N5 technology. The base tile is built on Intel 7. The link tile is built on TSMC N7. The HBM tile is obviously built in someone's memory technology. And I think something was built in an older Intel technology, but it's not on the slides.

Foveros is Intel's proprietary 3D technology for stacking chips. EMIB is Intel's 3D technology for connecting die alongside each other without the need for TSVs (it stands for embedded multi-die interconnect bridge).

Cerebras

Cerebras is the company that built a wafer-scale chip. It was originally announced at HOT CHIPS a couple of years ago, and you can read about what I said about it back then in my post HOT CHIPS: The Biggest Chip in the World. That version of the chip was in 16nm. Since then they have done a 7nm version, obviously with even more dramatic numbers:

46,225 mm2 silicon (the same size as before, the largest square die you can get on a 12" wafer)
2.6 Trillion transistors
850,000 AI optimized cores
40 Gigabytes on-chip memory
20 Petabyte/s memory bandwidth 220 Petabit/s fabric bandwidth

That "chip" goes into a system called CS-2, which I wrote about in Linley: Habana and Cerebras. I wrote about its predecessor (with the 16nm version of the chip) in Weekend Update 2.

At this year's HOT CHIPS, Cerebras presented how it is taking this up to "brain scale" with up to 192 CS-2s ganged together with a new memory unit, disaggregating memory and compute. The basic idea is to store all the weights in the memory and stream them through the CS-2s.

The new memory unit is called Memory-X. This allows up to 120 trillion parameters to be processed by a single CS-2. This can scale from 4TB to 2.4PB capacity. This allows it to hold 200 billion to 120 trillion weights along with some optimizer state. It is built on a mixture of DRAM and flash storage. There is also some internal compute for weight update/optimize.

It has intelligent pipelining to mask latency. During the presentation, Cerebras went into some detail on how this works, but it is too detailed for this post.

But it scales out further. A SwarmX interconnect unit between the memory and the CS-2 boxes allows it to scale almost linearly up to 192 CS-2s.

The wafers in the CS-2s are the matrix-multiplier arrays. You probably know that neural networks are mostly about matrix multiplication. The scale of CS-2 allows it to go up to 100K by 100K matrices. The weights are never stored on the CS-2s, they are streamed through from the memory, and in the case of multiple CS-2s, through the SwarmX simultaneously to all of them.

You won't be surprised to hear that they haven't actually built a real system with 192 CS-2s. But the graph above shows the projections, showing near-linear scaling. The ones that seem to fall off are not the biggest models but the smallest. There is just not enough computation going on with models with only 10B weights, so overhead starts to be significant.

This weight streaming model makes programming straightforward since a workload is mapped onto multiple CS-2s just as if it was a single bigger system.

So the final value proposition is:

120T parameter capacity on a CS-2 system
163M cores across up to 192 CS-2 systems
10X acceleration of unstructured sparsity
Push-button scaling ease

They think they can train GPT-3 in a day, or a 1T parameter model over a weekend.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.

"+ res.PreviousPostTitle); // //NextPostUrl // //Previousposturl // } // }); }); if ( $('.blog-post.nextweb-blog-post .ifrmesrc').length ) { iframeattr = $('.blog-post.nextweb-blog-post .ifrmesrc'); markup = ''; $('.blog-post-content .ifrmesrc').html(markup); $('.blog-post.nextweb-blog-post .ifrmesrc').show(); } -->