Google FeedBurner is phasing out its RSS-to-email subscription service. While we are currently working on the implementation of a new system, you may experience an interruption in your email subscription service.
Please stay tuned for further communications.
Get email delivery of the Cadence blog featured here
In the machine learning space, two significant things happened recently. The first was the recent Linley Spring Processor Conference 2021. The focus of the conference was processors for machine learning. Linley Gwenapp's keynote was titled Driving AI From the Cloud to the Edge. I'll summarize that later in this post, but actually a lot of what Linley said was similar to his keynote from last year's fall conference, despite being titled very differently as Application-Specific Processors Extend Moore's Law. That's because the main types of application-specific processors being developed are for deep learning. I wrote about that keynote in my rather generically titled post Linley Fall Processor Conference 2020.
The second thing that happened was the MLPerf announced the results of the latest benchmarks. If you don't know what MLPerf is, then see my post MLPerf: Benchmarking Machine Learning. But basically it is a set of standardized benchmarks so that machine learning algorithms and hardware can be compared. I'll cover that in a separate post, although some of the results were discussed during the Linley conference.
Linley's keynote was part educational and part a survey of the state of the industry today. He started off by discussing the difference between "AI processors" and "processors". We lack a word for a processor that is of the sort of microprocessor architecture that we are all used to, typically used as a general-purpose processor, or as the application processor in a smartphone, or even a simple microcontroller. Most, but not all, AI processors have an array of multiply-accumulate units (MACs) since the fundamental operation in most neural network implementations are lots of large matrix operations which involves lots of MAC operations. How the MACs are connected and how they are interfaced to memory is part of the secret sauce of the different approaches. This classification is somewhat fuzzy since GPUs are neither of these things, and specialized instructions (four 8-bit operations at a time) and operands (bfloat16 with a small mantissa and a large exponent) have been added to general-purpose processors.
One of the challenges is to balance flexibility and performance. If an AI processor is designed for a specific task, or even a specific algorithm to implement that task, the performance can be optimized. But then if details of the task change, the architecture can be suboptimal. Another tradeoff is latency and throughput. If you want to do image recognition on millions of images, they can be pipelined through so that the next image starts to be processed before the previous ones have completed. However, if you want to do the superficially similar task of recognizing a single image ("who is this person in front of me?") then there is no second image, and the focus is entirely on how fast the first image can be identified.
As a general rule:
Next Linley looked at some specific areas:
Image recognition: Larger models produce better results but have reached the point of diminishing returns. AmoebaNetB has 557 million weights. Most image recognition processors are run on Resnet-50 (224x224px) but HD (1920x1080px) requires 40X more computation, and 4K requires 160X more computation. As Linley said last year, "Resnet-50 is easy, real workloads are hard".
Language: models are growing about 40X per year, with the largest, Switch, now at 40 trillion weights. GPT-3 (175B weights) can create article summaries that nearly match a human’s. Google's T5 (11B weights) can translate general text from English to German.
Training models this size requires huge amounts of computation. Model "sharding" divides a model across many chips, in the case of the largest models, thousands of chips. There are lots of issues in what hardware and networking is required to make this work. Often, the training is done in a cloud data center, with instances that include a GPU or an FPGA. Sometimes, these are specialized designs like Google's TPUs.
Today, in the data center, NVIDIA is the clear winner with revenues up nearly 70% in 2020 (not including Mellanox revenue). The state of the art is the new Ampere A100:
Linley listed the training challengers and said that they "fall short of NVIDIA's Ampere".
For inference, it is more about efficiency than peak performance. Linley name-checked some of the challengers there, most of which (I'm sure coincidentally!) were presenting during the rest of the conference:
There are other approaches, such as using resistive or phase-change memory to store the weights as analog values. This is not very accurate, but the power and area saving is so great that it is not a problem to add a lot more weights to compensate. Several companies, including IBM, are working on this but no products have been released. Another approach is the ultimate in reduced accuracy, going to 1-bit neural networks. Again, there is obviously no accuracy but the cost of adding a lot more weights is minimal. There is also work going on to use photonics to build optical neural nets.
Several years ago, the Linley Spring Processor Conference was all about mobile. Lots of companies were building chips for mobile. It reminded me of the 1980s at VLSI Technology when we were working with what seemed like a couple of dozen companies designing chips for PCs, all with the strategy to be 20% of the PC industry. In both cases, almost none of them got anywhere much since the key players were also working with us to design their own chips. Then we entered the market with standard products too. For a time we were very successful, Intel even OEMed our chipset. But it was clear that we would be crushed once Intel designed its own chipsets, which it did. Similarly, all the market leaders in mobile design their own application processors, and there turned out not to be a merchant market for processors, just for solutions from companies like Qualcomm and Mediatek. In AI training, the same thing looks like it is happening:
Of course, that raises the big question of who is going to buy all these other processors when the volume purchasers are all rolling their own. The edge has a broader range of potential customers. The high-end smartphone vendors all have their own AI processors, but there are enormous numbers of other devices from smart doorbells, to smart speakers (Alexa, etc). While Amazon could obviously design a chip for Alexa if it made commercial sense, most companies selling products like this are not capable of designing their own AI chips even if they wanted to, and so they are potential markets for inference chips. But with a broader range of potential customers comes a lot of fragmentation—any individual customer is comparatively small.
That's my conclusion. Here's Linley's from his last slide:
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.