HOT CHIPS: Intel

29 Aug 2019 • 3 minute read

At HOT CHIPS in August, Intel was everywhere. The two announcements that I'm going to cover in this post were Spring Hill and Spring Crest, two deep learning accelerators, Spring Hill for inference and Spring Crest for training. But they also presented, for the first time ever, a lot of the internal details of Optane, their 3DXpoint phase-change memory, and also Lakefield, which is a tiny core for ultra-mobile always-on devices (that's it in the middle of the board in the picture with the quarters). There are three die in that little package in the middle, a base die for communications, a processor die, and a memory, all stacked on top of each other. See my post HOT CHIPS: Chipletifying Designs for (a little) more color on that.

Spring Hill—Intel/Nervana Datacenter Inference Chip

Spring Hill, which officially is the NNP-I 1000 (the I is for inference), was presented by Ofri Wechsler. It had design goals:

Best-in-class performance/power for major data center inference workloads
5X power scaling for performance boost 10-50W
Achieve a high degree of programmability without compromising performance/power efficiency, on-die Intel Architecture (IA) cores
Data center at scale, a comprehensive set of RAS features to allow seamless deployment in existing datacenters
Software stack supporting the usual frameworks

Here's how it turned out. There are two IA cores, and 12 Inference Compute Engines (ICE) on each die. The performance is 48-92 TOPS with power 10-50W, giving 2.0-4.8 TOPS/W. The chip is built in Intel's 10nm process.

The heart of the chip is the ICE. It contains a deep learning compute grid that can do 4K int8 MACs per cycle, with scalable support for FP16, and INT 4/2/1 too. There is high-bandwidth memory access with compression and decompression to handle sparse weights (lots of zeros). There is a programmable vector processor with high throughput and extended neural network support.

Here's a look inside the Vector Processing Engine block diagram.

During the Q&A, Ofri was asked about power. He said that they reuse lots of technology developed for other power-constrained designs. A lot of the power management and power delivery designs for laptops are useful here, too. They can dynamically shift the power budget depending on the workload. In fact, it is easier than in a laptop, where you have to work everything out on the fly, because a lot can be predicted ahead of time.

Here's how everything goes together. The vertical scale is simultaneously flexibility and performance per watt. At the top of the triangle is the host Xeon processor that is being offloaded by the Spring Hill chip. It is connected using PCIe Gen 3. The lower part of the triangle are the three layers inside the chip:

The IA cores, fully programmable but not good performance per watt
In the middle the vector processing unit
At the bottom, the deep learning compute grid, basically a sea of MACs

The Spring Hill chip can run ResNet50 at 3600 inferences per second at 10W, giving 360 images/second/W. There is now a more standard family of benchmarks for machine learning known as MLPerf (which was presented at HOT CHIPS and I'll cover someday soon). Intel has submitted Spring Hill to the 0.5 edition of the benchmark.

The chip is sampling to partners and customers now. The next two generations are in planning/design.

Spring Crest—Intel/Nervana Datacenter Training Chip

Spring Crest, officially the NNP-T SoC (the T is for training) was presented by Andrew Yang. Just to show you how different Spring Crest is from Spring Lake, here is the top-level layout of the chip. The four big blue blocks on either side are HBM2 memory stacks. There are 24 Tensor Processing Clusters (TPC) delivering up to 119 TOPS.

Some details of the chip include:

680mm2, 1200mm2 interposer (see picture)
27 billion transistors
60mm x 60mm/6-2-6 3325 pin BGA package
4x8GB HBM2-2400 memory
Up to 1.1Ghz core frequency
64 lanes SerDes HSIO up to 3.58Tbps aggregate BW
PCIe Gen 4 x16, SPI, I2C, GPIOs
PCIe and OAM form factors
Air-cooled, 150-250W typical workload power
All Bfloat16 multiplies with FP32 accumulate

Spring Crest is designed to scale out through an inter-chip crossbar. There are 16 quads of 112Gbps giving 3.58Tbps total bi-directional bandwidth per chip. The largest models can be run on multiple chips. It can scale all the way up to 1024 nodes, with a built-in programmable router. That's obviously more than you can fit on a single board and so it scales up at the rack level, too.

More Details

You can already watch the two presentations on the Intel website on their Nervana page. All the images in this post came from the Intel presentations.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.