Linley: AI Moving to the Edge

27 Apr 2022 • 6 minute read

breakfast bytes logo linley gwenapp Last week was the Linley Spring Processor Conference. One change since last year is that the Linley Group, who runs the conferences was acquired in October by Canada-based TechInsights. As it said in the press release announcing the acquisition:

"We have collaborated with The Linley Group many times over the years," said Gavin Carter, CEO of TechInsights. "Our offerings are well aligned, and The Linley Group’s microprocessor architecture commentary is complementary to the in-depth technical analysis of microprocessors from TechInsights. The addition of The Linley Group reports to the TechInsights platform will result in a richer set of content for our clients, particularly for those with interest in microprocessors."

Linley Gwenapp gave the keynote on the first day of the conference, and Jason Abt, the CTO of TechInsights, gave the keynote on the second day, which I'll cover in a post next week.

Trends in AI Acceleration

Linley updated his graphs on how fast models have been growing in 2021. The situation with imaging models and natural language processing (NLP) models are pretty different. In the graph above, imaging models are in black and NLP models are in red.

Image models continue to increase in size about 2X per year but have reached the point of diminishing returns with tiny increases in accuracy for huge increases in model size. Image models are typically quoted using ImageNet images which are 224x224 pixels. This is tiny compared to HD or 4K images. HD images (1080p) require 40X the processing of ImageNet images, and 4K requires 160X the processing.

Large NLP models are handling more complex tasks such as English-to-Chinese translation or creating article summaries. These models are huge and seem to still be growing at 40X per year. The largest now have as many as 10 trillion weights. The model sizes are limited by the training resources required (tiime and GPUs). For example, tranining GPT-3 takes 1,024 NVIDIA A100 GPUs for a month. That is $25 million or resources. Work is going on to discover how to increase model accuracy without requiring more cycles. One rule is that "bigger is not always better"—for example, DeepMind's Retro model beats GPT-3 using only 7 billion parameters (as opposed to 175 billion).

Another trend is to smaller and smaller datatypes. One of the surprising things about neural networks is how small the datatypes can get without loss of accuracy, not just 8-bit but 4-bit and even 2-bit. NVIDIA's recently announced Hopper is the first FP8 design. There are two different FP8 formats E5M2 with a 5 bit exponent and a 2 bit mantissa (plus the hidden bit since the mantissa always starts with 1) and E4M3 with a 4-bit exponent and a 3-bit mantissa. It seems that these very low precision FP8 formats work best with very large models. Other models don't tolerate these formats well.

Another trend is the increased use of sparse computing. This avoids doing MAC operations when the input is zero. NVIDIA's Ampere can even rearrange the computation to throw out 2 out of ever 4 coefficients, which can double throughput with no loss of accuracy if the discarded weights are close to zero.

Three other approaches are spiking neural networks (SNN) which operate more like the nerve cells in the brain and just use simple counters and an adder with no MAC. Power consumption is a lot lower. BrainChip, GrAI Matter, Inatyera, and Intel are among the companies trying this approach.

Analog computation reduces power by a lot, as much as 99%. The most popular approach is in-memory compute. Analog variation can reduce accuracy. Companies using this approach include Ambient, IBM, Mything, and TetraMem, although none of them has yet released a production product.

Photonics is a third approach that can reduce the power as much as 10X. Large optical systolic arrays can flow the data through efficiently. Companies working on this include Lightelligence, Luminous, and others.

What's New in the Datacenter

NVIDIA's recently announced Hopper H100 triples the prior A100 TOPS numbers, and leads to a 2.5X gain on most AI models. There is also a new system-level NVLink. Thre can be a performance gain of 6-9X for models with >100B parameters...but it comes at a cost in power, since it dissipates as much as 700W.

But ,as Jensen Huang, the CEO, has said:

Datacenter customers don't care about power, they care about power efficiency.

Although Hopper is the highest power, lots of others are close, as you can see from the above diagram.

When it comes to efficiency, Qualcomm tops NVIDIA. A Qualcomm Cloud AI 100 can outperform NVIDIA A100. It uses only 75W per card versus 400W, and less rack space. Hopper H100 will deliver greater performance but does nothing to close the gap on power efficiency. However, AI only handles inference, not training.

graphcore wow Graphcore Bow uses wafer-on-wafer technology to stack two die. With lots of deep trench capacitors, the power distribution network has more margin, and Grapcore Colossus die gains 40% clock speed while reducing voltage by 10%.

Going to the extreme, Linley covered two chips that I have previously covered, Cerebras's wafer-scale chip and Tesla's Dojo chips. There are no disclosed benchmarks for either of these. But you can read in-depth coverage in my posts:

In additional, datacenter CPUs are adding AI engines with reasonable TOPS numbers, if not as high as the standalone solutions:

Intel Sapphire Rapids adds AMX units for an 8X gain
IBM z16 (Telum) adds accelerator but performance is 6 TOPS on FP16
Marvell Octeon 10 includes dedicated AI acceperator that can reach 20 TOPS (Int8) in just 2W of power

Since these processors come for "free" with the basic processor, they are good for occasional AI work, small models, and mixed workloads.

What's New at the Edge

There are a number of processors targeting the high end, which is autonomous vehicles and multicamera surveillance systems:

NVIDIA's upcoming Orin offers 137 TOPS at 60W (shipping 2H 22). 275 TOPS in sparsity mode
AMD's Versal AI Edge FPGAs (fka Xilinx) offer up to 202 TOPS at 75W (1H 23)
Qualcomm Cloud AI 100 offers 400 TOPS at 75W (available now). Can also be configured for 100TOPS at 15W

Single-camera chips for less demanding applications are popular:

Hailo-8 offers 26 TOPS in 6W
Ambarella announced CV5S with 12 TOPS at 5W
Quadric offers 4 TOPS at 2W
RealTek offers 3916N with 0.7 TOPS in 0.8W
...and many others

Moving to IP (as opposed to actual chips), there are many startups:

EdgeCortix IP scales from 1K-33K MAC units at 54 TOPS
Edged.AI has IP that scales to 32 TOPS
Expedera IP scales from 1 to 100 TOPS with leading power efficiency
Vsora IP scales to 66 nFP8 TOPS per core, or 1,000 TOPS with 16 cores

And later that day Cadence announced the new NNE110 accelerator. See my post The Latest Addition to the Tensilica Family Is a Baby Neural Network Engine.

Even microcontrollers are getting on the AI bandwagon. Of course, for simple networks they can just run in software on the microcontroller. But some embedded microcontrollers include either vector extensions or AI acceleration:

Arm Cortex-M55 includes Helium vector extensions
Greewaves GAP9 compiles RISC-V MCU with 120 GOPS engine
NXP i.MX92\3 integrate an Arm Cortex-A55 with 512 GOPS Ethos DLA

Summary

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.

"+ res.PreviousPostTitle); // //NextPostUrl // //Previousposturl // } // }); }); if ( $('.blog-post.nextweb-blog-post .ifrmesrc').length ) { iframeattr = $('.blog-post.nextweb-blog-post .ifrmesrc'); markup = ''; $('.blog-post-content .ifrmesrc').html(markup); $('.blog-post.nextweb-blog-post .ifrmesrc').show(); } -->