Scaling Embedded Inference Performance for Deep Learning

3 Oct 2017 • 6 minute read

Tomorrow it is the Linley Microprocessor Conference and Pulin Desai of Cadence is one of the presenters this morning. He will be talking about Scaling Embedded Inference Performance for Deep Learning.

Deep Learning

Of course, Deep Learning (DL) is one of the hottest topics around right now. The reason is simple. There are lots of tasks for which the algorithmic approach is very difficult, since it is very difficult to work out what the algorithm should be. One of the classic problems is telling whether images are pictures of cats or not. Even if you can't program, you can see the problem if you think how you would explain to a kid who had never seen a cat, how to go about recognizing cats. That's before you worry about the hard cases: cats in the fog, cats at night, cats seem from odd angles. However, it turns out that convolutional neural nets (CNNs) are really good at solving this problem, provided you have lots of training data. In this case, training data means lots of pictures of cats, and lots of pictures of non-cat things like dogs and bagels and aircraft. The catch is that the data needs to be labeled, in this over-simple example just as cat and not-a-cat.

This approach works really well for many things that are more important than recognizing cats, which after all any toddler is pretty good at. Cadence's network for recognizing street signs using the german traffic sign benchmark does better than humans, for example. There is more to autonomous driving than recognizing the signs, of course, but clearly it is a real-world problem of some importance.

Embedded Inference

For the time being, training is done using a large number of processors in the cloud, typically aided by some sort of specialized engine beyond just the x86 processor in the server. The most well-known cases are NVIDIA GPUs, Xilinx or Intel (Altera) FPGAs, or Google's TPU (TensorFlow Processing Unit).

However, the inference, actually using the neural net to accomplish the task (recognizing cats, or understanding spoken English or Chinese) often needs to be done on the device. An obvious case is that a self-driving car cannot depend on the cloud for deciding whether a traffic light is red or green, or whether that object is a child or a fire hydrant. It can depend on the cloud for updating its mapping data with roadworks, or updating the software.

For some additional background from Breakfast Bytes, see my earlier posts:

There are two aspects to running a neural network inference on the edge device such as a mobile phone, car, or inside some IoT device. The first is reducing the size of the network to something that is manageable on a device. Most algorithmic work on inference is actually done in the cloud using high-performance servers, perhaps with accelerators, and all using 32-bit floating point. The above posts largely look at this problem. The other problem is to use a processor subsystem that is appropriate for neural network inference, and this is what Pulin was talking about today.

These devices seem to go under the name Neural Network DSPs. It actually reminds me of the early 1980s before the name ASIC was invented. We needed a name for that style of semi-automated design. Custom design didn't really work, that sort of implied polygon-level layout and circuit simulation. At VLSI, we went with USIC for a time (user-specific IC) but rapidly went with ASIC once it was clear that name was going to win, despite being less accurate (the ICs are not application specific, they are specific to a single customer). I think that these type of edge-device inference processors will get a catchy name and we'll all say "why didn't I think of that?" But for now we're stuck with the distinctly un-catchy NNDSP.

Devices Incorporating Neural Network DSPs

The latest mobile chips seem to be incorporating NNDSPs. The diagram above shows the Huawei Kirin 970, the first smartphone SoC chipset with a dedicated NNDSP. Details are starting to emerge as to what "bionic" means in the Apple A11 chip that is inside the recently announced iPhone 8, 8 Plus, and X. It contains a neural engine (maybe that's the name that will catch on) that sounds like a NNDSP to me.

The challenge for someone designing chips containing an NNDSP is mostly how to get the performance required, within the power budget allowed. Another challenge is that this is a field that is advancing in leaps and bounds. Six months is a long time in the deep learning world, so people have to pick a NNDSP now for a product that will finish design in 2018 and probably largely be shipping inside actual products (handsets, cars, cameras, whatever) in 2019 onwards. Whatever is designed into the hardware needs to be able to keep up with advances in deep learning that can be handled by updating the software, without requiring the SoC to be designed.

The Tensilica Vision C5 DSP

The Tensilica Vision C5 DSP is an NNDSP that is low power, scalable, and flexible, with a lot of special features for neural network inference. In particular, it addresses four key challenges in designing an NNDSP:

MAC architecture must achieve high MAC utilization, and be flexible about convolution sizes (11x11, 7x7, 3x3, even 1x1)
- The C5 has specialized dual quad-MAC architecture to get near to 100% MAC utilization for inner loops
- Optimizations for very small convolution dimensions
- Enhanced instruction set for multiple vectorization schemes
- Data compression to avoid multiplications by zero

Multiple neural network layer optimization, for high-performance processing of non-convolutional layers, with bit depth flexibility per layer
- Specific set of ALU operations for enhanced non-convolutional layers
- Fusion of multiple layers of higher performance
- Mixed 8- and 16-bit precision

Fixed point and quantization with support for 8 and 16, with networks trained in floating point but deployed in fixed
- 1024 8-bit and 512 16-bit MAC
- On-the-fly precision mixing (such as 8 bit for convolution and 16 bit for normalization)

Memory usage, with limited memory bandwidth and efficient access
- Integrated DMA
- Ping-pong memory buffering hides memory latency
- Rich and wide set of select operations for data manipulation
- On-the-fly weight decompression to reduce bandwidth needs

The result is that the Vision C5 DSP can deliver 1 TMAC/s in less than 1mm², and is optimized for vision, radar, lidar, fused-sensor, and other applications required for automotive, drone, and mobile/wearable markets.

On top of the base functionality that a single Vision C5 DSP offers, it is also architected to be used in multi-processor configurations. This allows multi-TMAC/s solutions, with shared memory, synchronization, but with synchronous multi-processor debugging (always a challenge in a multi-processor environment). The above diagram shows one potential system.

Software

Of course, all this would be academic without a working software stack. The above diagram shows how a design can start from a description of the neural network described in the standard industry open-source suites such as Caffe and TensorFlow. Cadence then provides both a compiler flow, and appropriate run-time libraries, so that the network can be compiled down to run on the Vision C5 DSP.

For More Information

For more information see the Tensilica Vision page, or download the Tensilica vision family product brief.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.