Linley: Driving AI from the Cloud to the Edge

12 May 2021 • 7 minute read

In the machine learning space, two significant things happened recently. The first was the recent Linley Spring Processor Conference 2021. The focus of the conference was processors for machine learning. Linley Gwenapp's keynote was titled Driving AI From the Cloud to the Edge. I'll summarize that later in this post, but actually a lot of what Linley said was similar to his keynote from last year's fall conference, despite being titled very differently as Application-Specific Processors Extend Moore's Law. That's because the main types of application-specific processors being developed are for deep learning. I wrote about that keynote in my rather generically titled post Linley Fall Processor Conference 2020.

The second thing that happened was the MLPerf announced the results of the latest benchmarks. If you don't know what MLPerf is, then see my post MLPerf: Benchmarking Machine Learning. But basically it is a set of standardized benchmarks so that machine learning algorithms and hardware can be compared. I'll cover that in a separate post, although some of the results were discussed during the Linley conference.

Linley Conference

Linley's keynote was part educational and part a survey of the state of the industry today. He started off by discussing the difference between "AI processors" and "processors". We lack a word for a processor that is of the sort of microprocessor architecture that we are all used to, typically used as a general-purpose processor, or as the application processor in a smartphone, or even a simple microcontroller. Most, but not all, AI processors have an array of multiply-accumulate units (MACs) since the fundamental operation in most neural network implementations are lots of large matrix operations which involves lots of MAC operations. How the MACs are connected and how they are interfaced to memory is part of the secret sauce of the different approaches. This classification is somewhat fuzzy since GPUs are neither of these things, and specialized instructions (four 8-bit operations at a time) and operands (bfloat16 with a small mantissa and a large exponent) have been added to general-purpose processors.

One of the challenges is to balance flexibility and performance. If an AI processor is designed for a specific task, or even a specific algorithm to implement that task, the performance can be optimized. But then if details of the task change, the architecture can be suboptimal. Another tradeoff is latency and throughput. If you want to do image recognition on millions of images, they can be pipelined through so that the next image starts to be processed before the previous ones have completed. However, if you want to do the superficially similar task of recognizing a single image ("who is this person in front of me?") then there is no second image, and the focus is entirely on how fast the first image can be identified.

As a general rule:

convolutional units (used for vision for example) are the least flexible but optimized for what they do
systolic arrays of MACs and general matrix multiplication units (GEMMs) are also specialized
smaller cores are more flexible
coarse-grained reconfigurable architecture (CGRA)
CPU/GPU/DSP with SIMD is the most flexible

Next Linley looked at some specific areas:

Image recognition: Larger models produce better results but have reached the point of diminishing returns. AmoebaNetB has 557 million weights. Most image recognition processors are run on Resnet-50 (224x224px) but HD (1920x1080px) requires 40X more computation, and 4K requires 160X more computation. As Linley said last year, "Resnet-50 is easy, real workloads are hard".

Language: models are growing about 40X per year, with the largest, Switch, now at 40 trillion weights. GPT-3 (175B weights) can create article summaries that nearly match a human’s. Google's T5 (11B weights) can translate general text from English to German.

Training models this size requires huge amounts of computation. Model "sharding" divides a model across many chips, in the case of the largest models, thousands of chips. There are lots of issues in what hardware and networking is required to make this work. Often, the training is done in a cloud data center, with instances that include a GPU or an FPGA. Sometimes, these are specialized designs like Google's TPUs.

Today, in the data center, NVIDIA is the clear winner with revenues up nearly 70% in 2020 (not including Mellanox revenue). The state of the art is the new Ampere A100:

624 INT8 TOPS (2.4X Turing) or 1,250 TOPS with structured sparsity • 310 FP16 Tflop/s (2.5X Volta) for training
40MB of cache (7X Turing/Volta)
But power increases from 300W to 400W TDP

Linley listed the training challengers and said that they "fall short of NVIDIA's Ampere".

Cerebras offers huge WSE with 400,000 cores and 18GB SRAM but haven't disclosed any benchmarks
Graphcore offers GC2 chip with 1,216 cores with FP16 capability (now in production, training speed similar to V100 on Bert)
Habana (Intel) developing Gaudi accelerator for training (initial benchmarks similar to V100 but production delayed)
SambaNova has chip and system but has not disclosed any benchmarks
Tenstorrent developing Wormhole accelerator for training

For inference, it is more about efficiency than peak performance. Linley name-checked some of the challengers there, most of which (I'm sure coincidentally!) were presenting during the rest of the conference:

NVIDIA A100 posts top ResNet-50 score but burns 400W
Qualcomm offers 4.5X better performance per watt than A100 (just 75W)
Tenstorrent also offers excellent performance per watt
Groq’s single-core design delivers the industry’s best latency
Habana (Intel) offers Goya accelerator for inference and is better than NVIDIA T4 on CNNs and Bert
SimpleMachines excels on BERT efficiency

There are other approaches, such as using resistive or phase-change memory to store the weights as analog values. This is not very accurate, but the power and area saving is so great that it is not a problem to add a lot more weights to compensate. Several companies, including IBM, are working on this but no products have been released. Another approach is the ultimate in reduced accuracy, going to 1-bit neural networks. Again, there is obviously no accuracy but the cost of adding a lot more weights is minimal. There is also work going on to use photonics to build optical neural nets.

Several years ago, the Linley Spring Processor Conference was all about mobile. Lots of companies were building chips for mobile. It reminded me of the 1980s at VLSI Technology when we were working with what seemed like a couple of dozen companies designing chips for PCs, all with the strategy to be 20% of the PC industry. In both cases, almost none of them got anywhere much since the key players were also working with us to design their own chips. Then we entered the market with standard products too. For a time we were very successful, Intel even OEMed our chipset. But it was clear that we would be crushed once Intel designed its own chipsets, which it did. Similarly, all the market leaders in mobile design their own application processors, and there turned out not to be a merchant market for processors, just for solutions from companies like Qualcomm and Mediatek. In AI training, the same thing looks like it is happening:

Alibaba deploys Hanguang ASIC for search and recommendations (industry-leading (by far) ResNet-50 score of 78,600 IPS)
Google uses four TPU generations for training and inference (TPUv4 performs similarly to NVIDIA A100 on MLPerf training benchmarks)
Amazon AWS offers Inferentia ASIC for cloud rental (but lower performance, power efficiency than NVIDIA T4 but in use for some Alexa voice processing). New chip, Tranium, coming later in the year for training.
Baidu deploys servers using Kunlun ASIC accelerator (similar performance to 2X NVIDIA T4 at same 150W power)

Of course, that raises the big question of who is going to buy all these other processors when the volume purchasers are all rolling their own. The edge has a broader range of potential customers. The high-end smartphone vendors all have their own AI processors, but there are enormous numbers of other devices from smart doorbells, to smart speakers (Alexa, etc). While Amazon could obviously design a chip for Alexa if it made commercial sense, most companies selling products like this are not capable of designing their own AI chips even if they wanted to, and so they are potential markets for inference chips. But with a broader range of potential customers comes a lot of fragmentation—any individual customer is comparatively small.

That's my conclusion. Here's Linley's from his last slide:

Ever-larger neural networks require greater chip performance, scalability
Smaller data types, sparsity create opportunity for “free” improvements in performance, but they require changes to software and hardware designs
Analog and optical AI promise bigger gains and are nearing production
Data center competition is growing from both chip vendors and cloud vendors, but NVIDIA still leads in performance (but not efficiency)
Many new companies are jumping into embedded AI processors, where barriers to entry are low and volumes are starting to ramp
Small accelerators can extend battery life, capabilities in low-cost sensors

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.