Linley Keynote Fall 2022

7 Nov 2022 • 6 minute read

linley processor conference Last week was the Linley Fall Processor Conference. That still seems to be the name even though the Linley Group was acquired by TechInsights last year. As always, the conference opened with Linley Gwennap's keynote. As has been the case for several years now, the focus was on what was going on in processors for AI training and inference.

He started looking at imaging. One issue with imaging is that everyone tends to use ImageNet, where the images are 224 pixels square. In comparison, HD (1920 by 1080) requires 40 times as much computation and memory, and some automotive apps are moving to 4K which requires even more. Large models have diminishing returns with VIT-4K requiring 100 billion weights.

nlp models

Natural Language Processing (NLP) models are ballooning even more since they are addressing more challenging tasks such as summarizing articles or translating English into Chinese. The largest models require 12 trillion weights! Model size is limited by the training time. For example, the well-known GPT-3 model takes 1,024 NVIDIA A100 GPUs over a month. Cluster size is topping out too, since 1,024 GPUs costs around $25M. So there has not been any real growth in the largest trained models in the last year. Training separate models in a "mix of experts" (MoE) is easier than training a single large model. Future growth will be paged by hardware progress such as the new NVIDIA Hopper H100 GPU.

ai data formats Another trend is the move to smaller data types, in particular, FP8 (yes, 8-bit floating point) is emerging for both training and inference, but there are different flavors depending on how many exponent bits are used. I covered this in more detail in my post HOT CHIPS Day 2: AI...and More Hot Chiplets. NVIDIA's Hopper is the first commercial design that supports FP8. It seems that FP8 is most beneficial for large NLP models.

Another tradeoff that some processors are making is sparse computing. NVIDIA's Ampere, for example, can preprocess the model to remove half the weight by eliminating values that are close to zero. This greatly improves performance, but there is sometimes a reduction in accuracy.

Some novel approaches are spiking neural networks (SNN) which are more like real brains. Analog and photonic computation promise large power savings. There are many startups working with some of these approaches (some of whom presented at the conference) but none have achieved high-volume production.

Linley said that in previous years he has been criticized for being "too nice" to everyone. So this year he had a few slides sprinkled through the presentation titled "What does Linley really think?" The first one was:

AI chip performance has grown incredibly rapidly: >2x per year
- Hopper H100 offers 100x more Tflop/s than Pascal P100 (2016)
This growth rate is not sustainable
- Driven by addition of matrix unit (aka tensor cores)
- Driven by move from FP32 to FP8/INT8 (little room for further progress)
- Driven by growth in power (TDP) from 300W to 700W (again, little room to grow)
Future gains will be more like 30–40% per year
- Power efficiency could improve even less quickly

mlperf results There are new results from the MLPerf benchmarks. See my post MLPerf: Benchmarking Machine Learning for background. Some random notes:

Intel's Gaudi2 leads NVIDIA's A100 on ResNet but not on Bert
Google's TPUv4 was about even with A100 in previous round of benchmarks
No NVIDIA Hopper training results yet

NVIDIA's Hopper aka H100 triples A100's TOPS and initial MLPerf results show an average gain of 1.8X (using the same datatypes). FP8 support improves training by another factor of 2X, and NVLink communications improve training of very large models by another 2X. But power rises to 700W and the efficiency gain on most models is about 40%. However, production is delayed, with the PCIe card version shipping now and SXM modules around end of the year, and DGX systems in Q1.

Qualcomm's Cloud AI 100 can outperform NVIDIA's A100 especially in power efficiency (75W per card versus 400W) and uses less rack space (2U vs 6U). Hopper/H100 will deliver greater performance but doesn't close the gap in terms of power efficiency. But also note that Cloud AI 100 only handles inference, not training.

processors with ai accelerators Datacenter processors are all adding AI engines:

Intel's Sapphire Rapids adds AMX units for 8X AI gain
IBM z16 (Telum) adds accelerator although peak performance is only 6 Tflops
Marvell Octeon 10 includes a dedicated AI accelerator that reaches 20 TOPS (INT8) in just 2W
Intel's approach is to add an AI accelerator to each core, everyone else adds a single AI accelerator to the multicore chip
All of these processors are "free" with the server processor, and so are good for occasional AI work although still nowhere near the efficiency of standalone AI processors

So what does Linley really think?

Large data-center customers need flexible and easy-to-use solutions
- They train and deploy a broad range of models in rapidly changing markets
Vendors must deliver a flexible architecture and broad software stack
- Too early in AI development cycle to add hardware support for specific algorithms
- Platform must work well for different models, even ones yet to be invented
Large chip vendors with extensive software expertise are best suited
- E.g., NVIDIA, Intel, Qualcomm, AMD (if it gets serious about AI)
Startups are currently serving niche markets or enterprises
- Difficult to justify multi-billion-dollar valuations without a breakthrough in share
Largest cloud-service providers can build their own AI chips
- But still struggle to build a broad software stack and flexible architecture
- Only Google has deployed in-house AI chips in high volume

Next, Linley moved on to the "edge", which he defined as everything outside the datacenter.

First, all PCs will have AI acceleration (Apple, AMD, Intel). Universal deployment will encourage more software development.

In automotive, NVIDIA's Orin offers 137 TOPS at 60W, Thor in 2024 at 1,000 TOPS (INT8) at maybe 300W. Intel's Mobileye (well, technically not Intel's anymore) has level 4 driving solutions with EyeQ5 needing several chips at 40W, and EyeQ6 Ultra with a single 100W chip but out in 2025. Qualcomm Cloud AI 100 delivers up to 400 TOPS at 75W, or 1000 TOPS at 15W.

There are lots of startups offering scalable IP for edge SoCs. Literally too many to mention. Established IP vendors, such as Cadence, offer AI tiers ranging from basic (4-256 MAC units) to max (1K-16K MAC units, 32 TOPS). But pretty much all the established IP vendors offer some sort of AI acceleration.

So what does Linley really think?

Automotive will be a huge market for AI chips
- Almost all vehicles will be Level 2 by 2025; Level 4 will be commonplace by 2030
- Level 4 will demand high-performance hardware and complex AI models
- We forecast ~$10 billion in automotive AI chip revenue in 2030—60% for Level 4
- Difficult market for new vendors
Most other edge applications don’t need an AI accelerator
- Internet-connected devices (speakers, cameras) can offload most AI to the cloud
- Line-powered devices can run AI on CPU with vector (SIMD) extensions
- Even microcontrollers (MCU) can handle basic AI tasks
- Standard programming model instead of a dedicated (proprietary) AI accelerator
- AI hardware is most valuable in battery-powered devices
- Laptops, smartphones, smartwatches use processors with integrated (typically in-house) AI engine
- Little volume left for licensed AI accelerators

That last bullet is the most telling. Given that the automotive market players have settled out, and all laptops and smartphones have (or will have) AI accelerators, there just is not a high volume opportunity for licensing AI accelerator IP or selling AI accelerator chips. The conference reminded me of the old Linley Mobile Processor Conference, where lots of people were designing processors for the mobile handset industry. But it turned out that there wasn't really a market at all. The leaders in mobile built their own processors, and Qualcomm and Mediatek supplied pretty much everything else. It seems clear that most of the AI processor companies will not find large profitable markets.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.