HOT CHIPS: Some HOT Deep Learning Processors

18 Sep 2018 • 5 minute read

If there was a theme running through the recent HOT CHIPS conference in Cupertino then it was deep learning. There were two sessions on machine learning, but also every processor described in the server processor session had something to handle deep learning training. I'm not going to attempt to write about all of them, but because of their ubiquity, I'll discuss the presentations by Arm and NVIDIA on their deep learning processors.

On the subject of deep learning, I covered the Sunday tutorial in HOT CHIPS Tutorial: On-Device Inference. The Arm and NVIDIA chips are focused on this area, and also take into account a lot of the specific compression techniques discussed in the tutorial.

Arm

Ian Bratt presented Arm's First-Generation Machine Learning Processor. This is a brand new processor optimized for machine learning. Like any specialized processor in this area, it is a big efficiency uplift from CPUs, GPUs, and DSPs. For now, at least, it seems to be called simply the Arm ML processor, although I expect it will get a Armish name when it is officially released later this year (TechCon is in October, so if I were a betting man I'd go for Mike Muller's keynote).

arm ml processor Ian gave the four key ingredients for a machine learning processor as: static scheduling, efficient convolutions, bandwidth reduction mechanisms, and programmability.

The static scheduling is implemented by a mixture of compilation, which analyzes the NN (neural network) and produces a command stream, and a control unit that executes the command stream. There are no caches or memory and DMA is managed directly by the compiler/processor.

Convolutions are done efficiently, mapping different parts of the input and output feature maps among the 16 processors in the system. The MAC engine itself (on each processor) is capable of eight 16x16 8-bit dot products, so with 16 MAC engines, you get 4096 ops/cycle, making 4.1 TOPS at a 1GHz clock. There is full datapath gating for zeros, giving a 50% power reduction. See tutorial post linked above for much more about handling zeros. There are also mechanisms for activations from one compute engine to another, which are broadcast on the network that links them all. The processor has a POP optimization kit for the MAC engines, tuned for 16nm and 7nm. This provides an impressive 40% area reduction and 10-20% power reduction versus just using the normal cells.

DRAM power can be nearly as high as the processor itself (yellow in the pie chart is the ML power, the rest is memory, blue for the weights, black for the activation), so compression to reduce this is important. The ML processor supports weight compression, activation compression, and tiling. This results in a saving of about 3X with no loss in accuracy (since it is lossless compression).

As discussed in the tutorial linked above, pruning during the training phase increases the number of zeros, and clustering can snap the remaining non-zero weights to a small collection of possible non-zero values (easy to compress). The models are compressed offline during compilation. The weights, which dominate later layers of networks, remain compressed until read out of internal SRAM. Compiler-based scheduling is tuned to keep the working set in SRAM, and tiled or wide scheduling minimizes trips to DRAM. Multiple outputs can be calculated in parallel from the same input. This is all possible due to the static scheduling, which is set up at compile time, and executed in the processor.

At the bottom of the block diagram above is the programmable layer engine. This is largely to future-proof the processor since the state-of-the-art in neural networks is evolving on almost a daily basis. Ian was deliberately vague about exactly what this processor is, but it "extends ARM CPU technology with vector and NN extensions targeted for non-convolutional operators". It handles the results of the MAC computations, and most of this is handled by a 16-lane vector engine.

The basic design is very scalable, in the number of compute engines (16 in this implementation), MAC engine throughput (add more MACs), and in the overall number of ML processors.

The summary of the new Arm ML processor is:

16 compute engines
~ 4 TOP/s of convolution throughput (at 1 GHz)
Targeting > 3 TOP/W in 7nm and ~2.5mm^2
8-bit quantized integer support
1MB of SRAM
Support for Android NNAPI and ARMNN
To be released 2018

NVIDIA

Frans Sijstermans of NVIDIA presented the NVIDIA Deep Learning Accelerator, NVDLA. It was originally developed as part of Xavier, NVIDIA's SoC for autonomous driving. It is optimized for convolutional neural networks (CNNs) and computer vision. NVIDIA decided to open source the architecture and the RTL. You can simply download it and use it without needing any special permission from NVIDIA. They have taken the view that they cannot cover all applications of deep learning. As Frans put it:

The more people who do deep learning, the better it is for us.

Obviously, for now anyway, the more people do inference at the edge, the more people need to do training in the cloud, and that means the more NVIDIA GPUs will be needed to do it. Of course, there are the usual advantages of open source in contributions from others in the community, and there is nothing but upside for NVIDIA if NVDLA becomes a de facto standard.

The high-level architecture is shown in the above block diagram. The processor is scalable. Frans talked about two particular configurations. "Small" has an 8-bit datapath, 1 RAM interface, and none of the advanced features. The "large" configuration had 8-bit- 16-bit and 16-bit floating point datapaths, 2 RAM interfaces, an integrated controller, weight compression, and more. Below is some performance data for the large configurations.

The processor is available at nvdla.org. Their summary paragraph there says:

The NVIDIA Deep Learning Accelerator (NVDLA) is a free and open architecture that promotes a standard way to design deep learning inference accelerators. With its modular architecture, NVDLA is scalable, highly configurable, and designed to simplify integration and portability. The hardware supports a wide range of IoT devices. Delivered as an open source project under the NVIDIA Open NVDLA License, all of the software, hardware, and documentation will be available on GitHub. Contributions are welcome.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.

"+ res.PreviousPostTitle); // //NextPostUrl // //Previousposturl // } // }); }); if ( $('.blog-post.nextweb-blog-post .ifrmesrc').length ) { iframeattr = $('.blog-post.nextweb-blog-post .ifrmesrc'); markup = ''; $('.blog-post-content .ifrmesrc').html(markup); $('.blog-post.nextweb-blog-post .ifrmesrc').show(); } -->