Intel at Linley

9 May 2019 • 4 minute read

At the recent Linley Spring Microprocessor Conference, there were two presentations by Intel about deep learning. The first was by Ian Steiner, the lead architect for Cascade Lake. The second was by Carey Kloss, the VP of Hardware for the AI products group. He was at Nervana, acquired by Intel in 2016.

Cascade Lake

Let's start with a block diagram. VNNI are the new Vector Neural Network Instructions. As I've said before, I don't think anyone really predicted until a year or two ago that you would not give up a lot of accuracy by doing 8x8 bit multiplies with a 32-bit result and accumulate, compared to 32-bit floating point. Without DNNI, what Ian called, legacy INT8, but using just the underlying highly parallel processor architecture, 128 MACs of this type could be done in three cycles, using two ports per core. With DNNI, all that is done in a single cycle, so a speedup of 3X.

Due to interactions between the underlying out-of-order architecture, memory interfaces, and other contention, the improvement you get on real workloads can be very different from just looking at a single instruction. In the case of neural networks, there are a number of more-or-less standard benchmarks, such as ResNet-X, and a number of neural network frameworks, such as TensorFlow and Caffe. Ian did have a caveat that, while image recognition is nice for demos, what Intel's customers mostly make money with are recommendation systems.

The comparison in the above chart shows the performance gain using VNNI with 32-bit floating point, and in the small print is the accuracy loss. Floating point is already very efficient on Xeon processors, and without VNNI, Ian said the gains from going to 8-bit were there, but small. With VNNI, on full models, the gains go from 2X to 4X. There is also, unsurprisingly, a significant improvement in performance per watt compared to FP32.

Another worry when you speed things up a lot, is whether the memory cache hierarchy can keep up, or whether more cache needs to be added. The above chart shows that the cache hierarchy can "feed the beast". The chart shows the memory bandwidth being sampled ever 50 microseconds. At the top, blue, is floating point. The bottom, purple, is with VNNI. The bandwidth is actually improved a lot, with the number of samples requiring more than 80GB/s dropping from 25% to 10%, and the number only needing less than 15GB/s almost doubling from 35% to 65%. Verdict: the cache can keep up.

Deep Learning by Design

Carey started with the three principles of the Nervana Neural Network Processor (NNP) design philosophy:

Fit as many multipliers on the die as possibe
Maximize on-die data re-use.
Ensure scaling for the largest problems.

The architecture of the NNP-L is shown in the diagram to the right. Thee are four HBM2 memories. There are also interchip links for scaling big multi-chip systems, and a PCIe for communicating with the host processor. The CCs are compute clusters.

The architecture inside is designed to minimize access to the HBMs. There are large local SRAMs in the CCs. Further, the data buses between the CCs are separate from the data buses to HBM, so the CCs can receive data from both neighbors and HBM at the same time. Once a value is read from HBM, it can then be shared among all the CCs without needing to go back to HBM to fetch the value for each CC. Less data movement also means lower power. Or as Carey calls it:

Read once, use many.

Scalability is all about running the world's biggest models, and enabling data and model parallelism. There is lots of scope for this in a neural network: to first approximation, if you have 1T operations to perform, the more compute power you throw at it, the faster it will go. But there are a lot of details, from minimizing memory accesses (already discussed) to how the interconnect is done.

As you can see from the photo, there is scaling within chassis, and from chassis to chassis.

He had graphs showing NNP-L scaling from 1 to 64 chips with degradation (on ResNet150) from 100% to 91%, very close to perfect.

The chip also supports Bfloat16 (with FP32 accumulation), which is very attractive from both ease of convergence (good) and power (low). It delivers "32-bit accuracy with 16-bit training speed".

Finally, Carey added a new, fourth, principle: Software—None of this matters if people can't program it. So it supports all the frameworks, and then with the Ngraph deep learning compiler can target any of the Intel processors from Atom to Xeon, from Nervana to Movidius.

The chips will sample to customers later in the year.

Carey wrapped up with a nice story about AI failure. His wife is called Milan (mee-lan). "Hey Siri, call Milan" he said.

From then on, Siri called me Lan. But she never called my wife.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.