Linley Processor Conference 2020 Keynote

1 May 2020 • 6 minute read

The Linley Processor Conference always opens with a keynote by Linley Gwenapp giving an overview of processors in whatever is the hottest area. Most of the other presentations during the conference tend to be in the hottest area, so it sets a context into which the other presentations slot. In the distant past, the focus was on data center processors. Then the most important processors in the world were all mobile. But that market largely went away once it became clear that the mobile players at the high end were going to build their own application processors, and Qualcomm and Mediatek had most of merchant market. Now, as someone said as an aside in a question, "this is the AI processor conference".

Indeed. Linley's keynote was titled The Next Generation of AI Processors.

Of course, like everyone else, Linley's presentation came from...his living room. He started off with the obvious trend that models are getting bigger, since larger models usually produce more accurate results. Image processing models are growing around 2X per year, and language processing models around 10X per year. This is definitely true in the academic world, where all that counts is accuracy and cost is ignored completely. But in the real world, where the focus is on cost-benefit and not just benefit, I'm not sure how true that is. A company might be able to make their natural language recognition 0.1% better with a model 100 times as large, but that is probably not the tradeoff they would make.

One challenge in the deep-learning world is how to assess "goodness" of a solution. There is the MLPerf organization that orchestrates standard benchmarks. But usually people take Resnet-50 or Alexnet (image recognition algorithms) and rate how many images per second it can process (pretty much any processor can do more than 10,000 images per second on Resnet-50). This is better than just using a measure like "TOPS" since a lot depends on the size of the MACS. It is a lot easier to get a TOP of 8-bits than Bfloat16, which is easier in turn than FP32.

Linley sees a lot more processors than I do (probably than anyone does -- it's his day job) so his perspective that a wide range of core-sizes are feasible is important. Small cores are easier to replicate but complicate both the interconnect and the software compilation to get the model mapped into the system. The cores range all the way from Cerebras at the high end (who presented at the conference and who I wrote about last year in HOT CHIPS: The Biggest Chip in the World) who have over 400,000 cores to Groq (who also presented) who have one huge core. Everyone else is in between, mostly with under a hundred, and a few closer to one thousand cores.

Most neural networks are based on MACs (multiply-accumulate units) often doing the multiply at one precision and the accumulate (add) at higher precision so that they don't either overflow (fixed-point) or lose precision (floating-point). But that is not the only way to implement a neural network. You can also use:

In-memory compute, in particular using the resistors in resistive or phase-change memory as analog values, not digital. This can reduce the power by 20X or more, but the challenge is that the resistors are not really controlled enough to hold, say, 256 values accurately. The analog variability is also a manufacturing challenge. But several companies are pursuing this approach, at least in research.
Binary neural networks (BNN). This takes quantization to the limit with weights that are -1 or +1. "Multiplication" is just an XNOR (very low power). The lack of precision can be compensated by a large increase in the number of layers. Currently, this still can't achieve the same performance (on Alexnet) but can get close with a lot less power and memory.
Spiking neural networks (SNN). These are more similar to how the brain behaves, with binary activation (spike or no spoke), just counters and adders (no MACs). They don't scale well to big problems but they can efficiently handle simple tasks.

Suppliers to Data Center

Linley moved on to look at the suppliers to the data center market. NVIDIA is the king of the hill, with the 2017 Volta V100 the best-selling GPU for the purpose, although the 2018 Turing T4 has overtaken the V100 in units (but not revenue). The next-generation Ampere product remains unannounced, although it is expected to deliver a big boost for AI.

But competition is coming in the training (data center) area:

Cerebras (that I already mentioned, the wafer-scale chip)
Habana developing Gaudi accelerator for training. Habana was acquired by Intel who also dropped their Nervana "Spring Quest" following the acquisition.
Huawei Ascend Max
Graphcore with 1,000 FP16 cores

On the inference side, as Linley put it, "vendors come and go":

Groq posts top Resnet score among merchant chips
Habana (now Intel) offers Goya accelerator for inference
Xilinx offers Alveo accelerator (along with FPGA, of course)
SambaNova showed a chip but no details
Intel ended "Spring Hill" development after Habana acquistion.
Wave Computing shut down

At to the cloud vendors themselves, they have internal projects, too (which always raises the question as to who the merchant semiconductor companies will sell to):

Alibaba deploys Hanguang ASIC for search and recommendations (industry-leading ResNet-50 score of 78,600ips)
Google uses three generations of its TPU for training and inference
Microsoft deploys FPGA-based Brainwave for limited inference
Amazon (Annapurna) offers Inferentia ASIC for cloud rental
Baidu deploys servers using Kunlun ASIC accelerator

Automotive

Most new cars in US/EU/etc have level 2 ADAS features such as AEB (automatic emergency braking), LKA (lane keep assist), and ACC (adaptive cruise control). Most automakers plan to skip from level 2 to 4 due to handoff problems at level 3. Initial models will be geofenced, only for daylight and good weather. Many now consider level 5 (fully autonomous, no controls) to be unattainable. The road to autonomy is taking longer and the initial focus is on commercial vehicles where the economic aspects are clearer.

Moving on to data centers on wheels, also known as automotive, the battle in the merchant space is between Intel (Mobileye) and NVIDIA for now, with Mobileye having the lowest power and lowest cost, but NVIDIA having the highest performance. In terms of general progress, GM customers have driven over 5M miles in level 3 "super cruise" mode. Tesla plans to deliver "full self-driving" this year as an OTA upgrade. Waymo has performed thousands of driverless rides in Phoenix, some with and some without safety drivers. Intel Mobileye continues to be committed to robotaxi service in Jerusalem by 2022. GM commits to producing Cruise by 2022, a level 4 robotaxi with no steering wheel or controls. But Linley is more cautious about timing and says "we expect many luxury brands to offer some level 4 consumer vehicles by 2025".

Mobile

AI accelerators are now standard in mid-premium and premium smartphones:

Qualcomm Snapdragon 765 adds tensor accelerator to Hexagon DSP
Samsung Exynos 980 uses custom neural engine
Huawei Kirin 810 features in-house DaVinci neural engine
MediaTek P90/G90 employs Cadence P6 plus custom neural engine
Unisoc (Spreadtrum) T710 features 6-TOPS AI engine

But to me, it remains a little unclear exactly what they are doing (beyond face recognition to unlock, etc).

Ultra-Low-Power AI

Most MCUs need an accelerator to run vision applications (for example, simple vision tasks such as owner recognition can require 2,000 MMAC/s)
The TinyML foundation is targeting small sensors. Their summit in February featured 20 presentations and panels, so there is interest.
Google has a beta version of TensorFLow Lite for MCUs (TFLM which I wrote about recently in HiFi DSPs - Not Just for Music Anymore).
Eta Compute offers Tensai MCU with CoolFlux DSP at 200 MMAC/s (basic voice recognition at just 2mW)
GreenWaves GAP9 MCU implements 9-core accelerator using RISC-V
BrainChip Akida employs spiking neural network to reduce power (in 16nm can run MobileNet at 30fps using just 79mW)

Summary

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.