# HOT CHIPS Tutorial: On-Device Inference

The Sunday of the annual HOT CHIPS (the 30th!) conference is tutorial day. In the morning, it was the Blockchain, which I missed due to other commitments. in the afternoon it was Deep Learning. This was divided into 3 parts:

- Overview of Deep Learning and Computer Architectures for Accelerating DNNs.
- Accelerating Inference at the Edge
- Accelerating Training in the Cloud

I am going to focus on the on-device inference section since that is most relevant to applications for the higher end entries in Cadence's Tensilica processor portfolio. I'll also pull in some information from the cloud-based segment on benchmarks. This was presented by Song Han of the MIT Han's Lab. Han's Lab is not just using his name, the H stands for High-performance, high-energy-efficient Hardware. A is for Architectures and Accelerators for Artificial intelligence, N is for Novel algorithms for Neural Networks, and the S is for Small Models (for inference), Scalable Systems (for training), and Specialized Silicon.

Obviously, one way you can do a better job of on-device inference is to build a special on-device inference engine. Indeed, during the two main days of HOT CHIPS several were presented from Arm, NVIDIA, Xilinx, and DeePhi...except that a few weeks ago Xilinx acquired DeePhi, so that one was Xilinx too. But there's more. All the server processor presentations had optimizations for neural network programming. Even the next generation Intel processor, which is called Cascade Lake SP for now, has new extensions to the ISA adding a couple of instructions specifically for evaluating neural nets faster than is possible with the regular instructions. But that is a topic for another day (or two).

Training a neural network almost always takes place in the cloud, using 32-bit floating point. There is a lot of research that shows that you need to keep the precision during training even if eventually you plan to run a reduced model. If you reduce too soon, you miss getting stuck in local minima, or ending up in something that does not converge. Usually, when you see a graph showing a surface representing the space that the training algorithm is exploring, it is a nice smooth saddle where nothing can go wrong. But the picture below is actually more representative:

## Deep Model Compression

I first saw Song Han speak at Cadence. See my post The Second Neural Network Symposium for more details. Back then Song was still doing his PhD on compressing neural networks. Somewhat to everyone's surprise, it turns out that you can compress neural networks a lot more than anyone expected.

### Pruning

The first optimization is pruning. The network as it comes out of training has a lot of connections, and many of them can be removed without any loss of accuracy. Once they are removed, the network can be retrained with the reduced connectivity, and the accuracy is regained by retraining and recalculating all the weights. The process of pruning and retraining can be iterated until there is no reduction without too much loss of accuracy.

It turns out that the human brain does pruning too. A newborn has 50 trillion synapses, this grows with the brain until there are 1,000 trillion synapses by the time a baby is one year old. But that gets halved back down to 500 trillion synapses by the time that baby is an adolescent. Pruning the neural network this way has a similar effect, and sometimes the pruned and retrained network is not just smaller than the original but has increased accuracy too. Using this approach on AlexNet, the convolutional layers can be reduced by 3X, and the fully connected layers by 10X.

### Sparsity

The next technique is sparsity. There is obviously a straightforward optimization when a zero weight is fed into a multiplier since we know that anything times zero is zero. So not only do we not need to feed the zero into the multiplier, we can hold the other input at its old values and save both a memory access and power from toggling the bus. When training, a lot of weights are barely participating in the inference and are close to zero. By setting them exactly to zero, the matrix becomes sparse and all sorts of optimizations are possible.

The sparsity can be unstructured, or it can be structured, as in the diagram below. By using sparsity, a network that looks like it can deliver, say, 1TOPS can deliver 3 TOPS (if you count all the operations involving zero that were never actually executed).

### Quantization

Quantization in this context means reducing the width of the weights from 32-bit floating point, to 16-bit, 8-bit, or even lower. It seems surprising that you would not lose a lot of accuracy by doing this, but deep compression really works. Song actually did a lot of the research in this area as part of his doctoral thesis, and found "you could be significantly more aggressive than anyone thought possible."

### Putting It All Together

If you do all of this, you get compression ratios as high as 50X. If that number is surprising, then more surprising still is that every one of the benchmarks that Song talked about had increased accuracy with the compressed networks. Compression is not a compromise, there is clearly a reason mother nature prunes our brains too.

But wait, there's more...the pruned models accelerate image classification and object detection. This is because the limit on speed (the so-called "roof line") is the hitting the memory bandwidth limit, not hitting the computational limit. By reducing memory accesses, the computation units can be kept busier. This is almost independent of what engine is being used to perform the inference. There really does seem to be only upside to compressing the network: smaller, faster, more accurate.

## Designing Hardware

Based on the presentations of specialized neural network processors over the following couple of days, I would say that the lesson that everyone has taken away from the work of Song Han (and others) is:

- Train in the cloud at full precision
- Compress the network using the techniques above
- Optimize the inference hardware for sparse matrices, avoiding representing zeros.
- Optimize for MAC operations where one input is zero, and suppress the operation, and the access to the non-zero operand.
- Reduce the precision to 8-bits (or maybe 16-bits) and built lots of 8-bit MACs.
- Don't use caches, you are just wasting area. Be smart about ordering the operations so that values fetched from memory are re-used as much as possible, rather than moving on and coming back to reload the same value later (of course, you can't avoid this completely, but you can be smart, or rather your compiler can be).

**Sign up for Sunday Brunch, the weekly Breakfast Bytes email.**