Get email delivery of the Cadence blog featured here
At HOT CHIPS in August, Intel was everywhere. The two announcements that I'm going to cover in this post were Spring Hill and Spring Crest, two deep learning accelerators, Spring Hill for inference and Spring Crest for training. But they also presented, for the first time ever, a lot of the internal details of Optane, their 3DXpoint phase-change memory, and also Lakefield, which is a tiny core for ultra-mobile always-on devices (that's it in the middle of the board in the picture with the quarters). There are three die in that little package in the middle, a base die for communications, a processor die, and a memory, all stacked on top of each other. See my post HOT CHIPS: Chipletifying Designs for (a little) more color on that.
Spring Hill, which officially is the NNP-I 1000 (the I is for inference), was presented by Ofri Wechsler. It had design goals:
Here's how it turned out. There are two IA cores, and 12 Inference Compute Engines (ICE) on each die. The performance is 48-92 TOPS with power 10-50W, giving 2.0-4.8 TOPS/W. The chip is built in Intel's 10nm process.
The heart of the chip is the ICE. It contains a deep learning compute grid that can do 4K int8 MACs per cycle, with scalable support for FP16, and INT 4/2/1 too. There is high-bandwidth memory access with compression and decompression to handle sparse weights (lots of zeros). There is a programmable vector processor with high throughput and extended neural network support.
Here's a look inside the Vector Processing Engine block diagram.
During the Q&A, Ofri was asked about power. He said that they reuse lots of technology developed for other power-constrained designs. A lot of the power management and power delivery designs for laptops are useful here, too. They can dynamically shift the power budget depending on the workload. In fact, it is easier than in a laptop, where you have to work everything out on the fly, because a lot can be predicted ahead of time.
Here's how everything goes together. The vertical scale is simultaneously flexibility and performance per watt. At the top of the triangle is the host Xeon processor that is being offloaded by the Spring Hill chip. It is connected using PCIe Gen 3. The lower part of the triangle are the three layers inside the chip:
The Spring Hill chip can run ResNet50 at 3600 inferences per second at 10W, giving 360 images/second/W. There is now a more standard family of benchmarks for machine learning known as MLPerf (which was presented at HOT CHIPS and I'll cover someday soon). Intel has submitted Spring Hill to the 0.5 edition of the benchmark.
The chip is sampling to partners and customers now. The next two generations are in planning/design.
Spring Crest, officially the NNP-T SoC (the T is for training) was presented by Andrew Yang. Just to show you how different Spring Crest is from Spring Lake, here is the top-level layout of the chip. The four big blue blocks on either side are HBM2 memory stacks. There are 24 Tensor Processing Clusters (TPC) delivering up to 119 TOPS.
Some details of the chip include:
Spring Crest is designed to scale out through an inter-chip crossbar. There are 16 quads of 112Gbps giving 3.58Tbps total bi-directional bandwidth per chip. The largest models can be run on multiple chips. It can scale all the way up to 1024 nodes, with a built-in programmable router. That's obviously more than you can fit on a single board and so it scales up at the rack level, too.
You can already watch the two presentations on the Intel website on their Nervana page. All the images in this post came from the Intel presentations.
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.