Linley Gwennap's Deep Dive into Deep Learning

1 May 2019 • 4 minute read

At the recent Linley Spring Microprocessor Conference, Linley Gwennap kicked off with the opening keynote on what is clearly the biggest thing to hit processors in a long time: deep learning.

Linley started with an overview of deep learning and the latest trends. I'm going to skip that since I've covered the basics in many posts already. Start here if you need to get up to speed:

Even if you don't have time to read these, a quick summary would be that:

For training, it is mostly 32-bit floating point AFP32 (some FP16), but there is also an up-and-coming Bfloat16 that has the exponent range of FP32, namely 8 bits, and just 7 bits of mantissa, so less accuracy but greater range than regular FP16
For inference, especially on the edge, it is mostly 8-bit multiplies with 32-bit accumulates, integer arithmetic rather than floating point

Deep Learning in the Datacenter

You probably already know that NVIDIA and Intel dominate in the datacenter. Intel has added AI-boost instructions to the new Cascade Lake that triples performance (at least for things like 8-bit arithmetic). Intel presented that processor at the conference, and I'll cover that in another post. Or, if you want a preview, Intel gave some details about it at HOT CHIPS last summer, which I covered in Intel's Cascade Lake: Deep Learning, Spectre/Meltdown, Storage Class Memory).

Today, the most popular training option is the NVIDIA V100 Volta. This is a GPU with tensor cores delivering 125 Tflop/s (FP16) at 300W. The NVIDIA T4 provides a PCIe accelerator for AI inference, with integer-optimized cores, and an estimated 80 TOPS of 8-bit arithmetic at 70W.

But new challengers are emerging.

Habana is in production with Goya accelerator for inference, with leadership performance on ResNet-50
Graphcore offers GC2 chip with 1,216 cores with FP16 capability
Xilinx offers Alveo accelerator with preprogrammed FPGA, beats NVIDIA T4 for inference on small batch sizes
Huawei offers Ascend MAX for shipment mid-2019
Wave has developed dataflow processor with 16,384 cores, available either as systems or licensable IP

In addition, cloud providers have their own:

Google deployed TPUv1 (inference only), and then v2 and v3 (training and inference)
Microsoft uses FPGA-based Brainwave for some inference
Annapurna (acquired by Amazon) designed Inferentia ASIC for deployment in AWS in 2H19
Alibaba and Baidu have developed FPGA-based accelerators

Inference at the Edge

The general trend is to move AI to the edge. Today, products like Alexa and Siri send a compressed voice recording up to the cloud to do voice recognition and natural language processing. But, especially in aggregate, there is a huge amount of processing power at the edge and it makes sense to use that rather than have to add more cloud capability. It also reduces latency, works while offline, and is better from a privacy point of view.

High-end smartphones have AI capability:

Apple A11 and A12 processors integrate a neural engine
Samsung Exynos 9810, in the Galaxy S9, uses the neural engine from DeePhi (Xilinx)
Huawei Kirin 970 and 980 feature a neural engine from Cambricon
Qualcomm Snapdragon 845 and 855 include Hexagon vector DSP
Mediatek P90 uses Tensilica Vision P6 with a custom neural engine

This AI inference is creeping down into both mid-range phones and also IoT devices such as voice assistants, smart security cameras, and drones. In particular:

Intel Myriad X with vision-processing hardware, direct camera interfaces, offering 1TOPS at 2.5W (with SPARC under the hood, at least for now)
Bitmain BM1880 with Arm under the hood, also 1TOPS at 2.5W
Chinese chipmakers developing chips for this market (since most of the cameras come from China): Canaan/Kendryte, Cambricon, HiSilicon, Horizon Robotics
Arm Cortex-M4 has DSP extensions to improve performance
Eta compute offers Tensai MCU with CoolFlux DSP accelerator
Greenwaves GAP8 MCU implements 8-core accelerator with RISC-V architecture

Automotive

One particular type of "edge device" is the automobile. (I won't cover trends in automotive. I have covered them repeatedly in the last few years. If you need to get up to speed then start with Automotive Summit: The Road to an Autonomous Future if you need a kick-start). The current competitors in the merchant market are NVIDIA and Intel.

Intel's Mobileye subsidiary is the market leader for level 1/2 ADAS. It offers the lowest power at the lowest cost with $700M in 2018 revenue
- EyeQ4 (production) offers 2TOPS at 6W
- EyeQ5 (2020) offers 12TOPS at 5W
NVIDIA offers powerful processors targeted at level 4/5 and is the leader in autonomous vehicle design wins
- Xavier (production) offers 30TOPS at 30W
- Pegasus Drive can combine one or two Xavier chips and one or two Volta GPUs

It was too late to get a mention at the conference, but Tesla recently announced their FSD processor SoC. See my recent post Tesla Drives into Chip Design.

Conclusions

AI is moving from adapted architectures such as GPUs and DSPs to special-purpose custom-designed architectures with MAC arrays
Inference is moving from the data center to the edge
Smartphone vendors see AI performance as a differentiator (face recognition, etc)
Automotive is a big processor opportunity
IoT is the highest volume area for AI-enhanced processors

But if you have been reading Breakfast Bytes regularly, you already knew that. But this post contains a lot of detail about who the players are and what their current (and in some cases, future) offerings are.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.