HOT CHIPS Day 2: AI...and More Hot Chiplets

12 Sep 2022 • 7 minute read

hot chips logo My post about the first day of HOT CHIPS appeared yesterday. See HOT CHIPS Day 1: Hot Chiplets.

The second day seemed almost like Tesla day, with two presentations on various aspects of DOJO and the day's keynote, Beyond Compute - Enabling AI through System Integration, by Tesla's Ganesh Venkataramanan. I'll cover that keynote in its own post. NVIDIA also showed up three times on the second day.

The focus of a lot of the day was on AI and Deep Learning. As on the first day, almost every chip was either directly for AI or was a general purpose processor with some sort of AI acceleration. And, once again, a lot of the systems were constructed using chiplets and systems-in-package (or even more extreme forms of packaging for DOJO and Cerebras).

I'm just going to focus on the scale of these chips since there is not enough space to describe each of them in detail. Also, a reminder that if you really want all the details, you can still register for HOT CHIPS, despite it being over, and watch the video replays of the presentations (and PDFs of the presentations).

Groq Software-Defined Scale-out Tensor Streaming Multi-Processor

Dennis Abts presented Groq's multicore mesh architecture on GroqChip 1. There is no chiplet angle to this since the technology is built as a single chip, albeit on a massive scale. A lot of the smarts of the chip are in how the software schedules operations around the vector unit (VXM) in the center of the chip, and explicitly handles all the memory references efficiently.

Boqueria - Next Generation At-Memory Inference Acceleration Device with 1,000+ RISC-V cores

Boqueria is the chip introduced at HOT CHIPS by Robert Beachler of Untether AI. It is a single chip manufactured in 7nm, delivering 2 Petaflops of FP8 at 30 TFLOPs/W.

Yes, you did read that right: Boqueria uses an 8-bit floating point (FP8). In fact, the number of bits used for exponent and mantissa gives FP8r (for range) and FP8p (for precision). In the analysis that Untether AI did, this turned out to be a sweet spot and twice as energy efficient at INT8. The accuracy loss from INT8 to FP8 is negligible, and it quadruples the throughput.

boqueria scaling

The architecture scales from little sub-W up to datacenter scale. And yes, it is available as a chiplet for integrating into systems-in-package.

By the way, Boqueria is also the famous market in the center of Barcelona. One year that I was at Mobile World Congress, I got up early and went there before it got crowded with tourists and made a video of all the wonderful food:

DOJO: The Microarchitecture of Tesla’s Exa-Scale Computer

tesla dojo integration

I wrote about DOJO last year. It was not presented at HOT CHIPS 2021 but at Tesla's AI Day soon after. See my post NOT CHIPS: Tesla's Project Dojo. I'm not sure it does the design justice to describe it as being chiplet-based. There are 354 processors per chip, and then 25 chips are mounted on a wafer-sized interposer with cooling and power supplied vertically. Like Boqueria, it also supports two flavors of 8-bit floating point, although it also supports larger formats all the way up to FP32. Emil Talpes described the instruction set and the implementation in a lot more detail than I can put here.

DOJO - Super-Compute System Scaling for ML Training

dojo training

Next, Bill Chang described how DOJO is used for ML training.

Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning

The Cerebras WSE-2 "chip" is another one that I've written about before. I put chip in quotes since it is actually built out of a whole wafer. See my posts HOT CHIPS: The Biggest Chip in the World and HOT CHIPS: Two Big Beasts. This year, Sean Lie finally revealed some details of the internal architecture of the system. He also had a slide showing some of their customers (see above).

You can see the statistics for the WSE-2 above: 850,000 cores, 2.8 trillion transistors, and 40 GB of on-chip memory.

AMD 400G Adaptive SmartNIC SOC

The next section of the conference was on networks and switches. I'll just list some of the statistics without digging into all the details presented.

amd switch

Juniper’s Express 5: A 28.8Tbps Network Routing ASIC and Variations

Chang-Hong Wu described how the X-chiplet and the F-chiplet are assembled into ASICs 1-7, with ASIC 8 in the future with co-packaged optics.

I don't have space to describe all the ASICs, but here's just one of them as an example: ASIC 2. There is one X-chiplet, one F-chiplet, 4 HBM stacks, two silicon interposers, and an organic interposer underneath everything.

NVLink-Network Switch - NVIDIA’s Switch Chip for High Communication-Bandwidth SuperPODs

The next section was NVIDIA, starting in the network and switches with NVLink, and then, after the Tesla keynote, coming back with its Orin automotive chip and the Grace CPU (that goes with the Hopper GPU presented on the first day).

nvlink4

Alexander Ishii and Ryan Wells described NVLink, in particular the NVLink4 switch chip, built-in 4nm with over 25B transistors.

NVIDIA’s Orin System-on-Chip

Next up was the weirdest session of the conference, titled "ADAS and Grace."

Orin SoC

Michael Ditty presented Orin, NVIDIA's ADAS chip for automotive applications. There is no chiplet angle to this one. It is a big SoC, as you can see from the above statistics.

NVIDIA’s Grace CPU

Grace is NVIDIA's first CPU. It is designed to work with the Hopper GPU, as in Grace Hopper. If you don't know who she was, see my post Grace Hopper Celebration of Women in Computing.

grace cpu

Jonathon Evans presented the design, which he emphasized was:

Designed from the ground up to be a superchip

You can either put two Grace chips together, or put Grace together with Hopper, or create still bigger systems using NVLINK. The one thing that you cannot do is use a single Grace on its own. It is always paired.

AMD Ryzen 6000 Series Processor

The final session of the conference was on mobile and edge processors. Note that in this context, "mobile" means laptops, not mobile phones, although MediaTek did manage to slip a smartphone SoC into this session.

ryzen 6000 amd

Jim Gibney presented AMD's Ryzen 6000 processor. It was almost a surprise that it didn't involve chiplets and is a single SoC with 13.1B transistors (6nm) on a die size of 210mm²).

Meteor Lake and Arrow Lake: Intel Next Gen 3D Client Architecture Platform with Foveros

Just when it seemed the conference would end without more chiplets, Intel's Wilfred Gomes presented Meteor Lake and Arrow Lake, which use Intel's 3D technology Foveros. A good part of the presentation was on the fundamentals of chiplets and building systems in this manner. As he put it:

Can we get monolithic performance with disaggregated architecture benefits?

Of course, the answer was "yes", and Wilfred spent some time going over the details of Intel's More and Moore packaging technologies, such as Foveros Omni and Foveros Direct. Foveros Direct allows direct die-to-die bonding of copper interconnect. The bump pitch is 25um, allowing about 1,600 per mm² and <0.15pJ per bit.

meteor lake

All this allows Meteor Lake to be created as the next step on Intel's "disaggregation journey."

This has a CPU tile, a GPU tile, an SoC tile, an I/O extender tile, and underneath it all, a base tile (see the diagram above). Meteor Lake has been booted in the lab.

meteor lake

This approach allows scaling across multiple process generations. Meteor Lake is on Intel 4 (fka 7nm). Next is Arrow Lake on Intel 20A (20 Ångströms), and Lunar Lake is out so far in the distance that it is blurry, and the process is just "Intel Next."

Dimensity 9000 – A Flagship SmartPhone SoC

Hugh Mair presented MediaTek's Dimensity 9000 chip for smartphones. Again, this is a monolithic SoC and not a disaggregated design like the previous Intel presentation.

Next-Generation Intel Processor Build for the Edge - Intel Xeon D 2700 & 1700

Intel was back for the last presentation of HOT CHIPS 2022. Praveen Mosur presented the Xeon D 2700 and 1700 "built for the edge". These were formerly called Ice Lake D. I'm not sure if it was explicitly said, but these seem to be single SoCs, not part of Intel's "disaggregation journey."

Learn More

For full details of the HOT CHIPS this year, see the program page. As I said at the start, you can still register and see the presentations for somewhere between $20 and $65, depending on whether you are a student and/or an IEEE member.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.