Get email delivery of the Cadence blog featured here
Last week was the Linley Group's Fall Processor Conference. The conference opened, as usual, with Linley Gwenap's overview of the processor market (both silicon and IP). His opening keynote was titled Application-Specific Processors Extend Moore's Law.
As Linley put it:
Moore's Law Is Winding Down.
Moore's Law Is Winding Down.
The state of play is that TSMC's 5nm is in production, with the first product being Apple's A14 application processor in the recently announced iPhone 12. TSMC expects 3nm in production in 4Q22. At its technology symposium, it actually said risk production in 2021. You can read more about the process roadmap in my post TSMC Technology Symposium: All the Processes, All the Fabs.
Linley went on to say that:
I'm not sure I agree with those statements. Resistance is a problem, of course, but interconnect can get thicker as well as narrower. EUV steppers are expensive—but the alternative is a lot of expensive multiple patterning and lower fab throughput. My assumption has always been that EUV is cheaper, at least after getting up the learning curve. There is lots of expanded transistor count going on without being "stymied by cost and power limits". For example, AMD just announced their new gaming processor for $299 and has the performance lead. I don't have any knowledge about NVIDIA's GPU performance, but it seems unlikely its throughput is lower, but maybe it makes more sense to put more cores rather than faster cores.
Anyway, I'm certainly not going to argue that Moore's Law is not slowing down. Even if densities keep going up, it is clear that the price per transistor is not falling in the way that it used to. As I've said before (such as in my post Domain-Specific Computing 3: Specialized Processors), general-purpose processors have run into the wall for performance, but mostly that is due to hitting the power limits. The way to get around that is with domain-specific processors instead of just adding more identical cores. In fact, Cadence has an entire product line of Tensilica processors that are largely targeted at being an easy way to create domain-specific computing.
Or as Linley put it, "custom accelerator chips optimize computer, memory resource". Examples are "GPU, DSP, network processor (NPU), AI accelerators, data processors (DPU), storage".
Network processing is moving from the main CPU to specialized network processors. This has more flexibility for advanced functions like virtual routers and Cloud RAN. Also, they offload most of the processing requirements off the main processors, so freeing that up for general-purpose processing.
AI accelerators have their own set of challenges since moving the data around can consume as much power as doing the calculations. Caches don't help since the order of access is known in advance so there are better ways to optimize things, in particular, SIMD (single-instruction multiple-data) which shares instruction decoding across many datapaths, or systolic arrays, where data flows from one MAC unit to its neighbor on each cycle.
Another way to optimize memory better is known as in-memory compute. There are two main approaches. One is to move small multiple small compute units closer to the memory array, the other actually uses the memory array to compute using analog functions and, perhaps, analog versions of magneto-resistive RAM (MRRAM) where the variable resistance can be used to hold analog weights for neural networks. There is a lot of loss of accuracy, that has to be made up for with a bigger network, but the power is greatly reduced.
Models for AI (parameters, or weights, for neural networks) are getting larger...much larger. Image processing networks are growing about 2X per year, but natural language networks were growing at about 10X per year and it now seems to be up to 20X per year. The recently announced GPT-3 that you probably have heard about has 175 billion parameters (whereas GPT-2 only had 1.5 billion).
There are lots of choices of architecture for AI acceleration, with little cores or big cores Little cores are easier to design and replicate on large chips. They are also easier to use to hit various performance/power points. For example, there is the Cerebras wafer-scale chip that I wrote about in HOT CHIPS: The Biggest Chip in the World with over 400K cores, 850K on the version 2 chip/wafer they announced recently. At the other end of the scale are big cores like Groq or Googles TPU v3, Alibaba, and Habana (Intel), all of which have less than eight cores (one in the case of Groq). Big cores make interconnect on the chip a lot simpler and simplify the compiler design. Multicore designs (small cores) tend to have more latency since data has to move through the chip, so they tend to require larger batch sizes to get full efficiency (but then training loads, as opposed to inference, typically have large batch sizes).
Almost all AI accelerator cores support 8-bit (INT8), 16-bit floating point (FP16), and Bfloat (with the same exponent range as FP32 but a smaller mantissa, since dynamic range is more important than high-accuracy). Many of them also optimize sparsity and optimize for 0 value weights and input data. This doesn't improve performance but does improve power (by not moving the data nor doing the multiplication). NVIDIA's Ampere even removes all values close to zero and so eliminates 50% of the weights, and since it removes the activations completely it doubles the throughput.
Some networks go further and go to binary neural networks (BNNs) with 1-bit weights. Multiplication is thus just XNOR and so greatly reduces power. This approach achieves about 70% accuracy versus 80% for full precision, but with requires a lot fewer resources. Scaling allows this approach to get within a few percent for over 90% reduction in memory required for weight storage. Examples are Lattice, LeapMind, and XNOR.ai (now Apple).
Spiking neural networks are another approach, more analogous to how neurons work in the brain. No MAC is needed, just counters and adders to sum the spikes, and corresponding much lower power (see BrainChip, GrAI Matter, Intel).
The typical server for AI training in the data center is an Intel processor with an NVIDIA GPU. As Linley titled his slide "NVIDIA extends market lead".
NVIDIA Volta V100 (2017) is the best-selling processor for AI training, with Google's TPU being the only significant competition (and you can't buy one). NVIDIA Turing T4 GPU (2018) is gaining ground for AI inference. It surpassed Volta in unit sales at the end of last year (but not revenue since it is much smaller). NVIDIA's data center revenue is up 80% in 1H20 (not even including Mellanox (who it acquired in a deal closing in April). Next-generation NVIDIA Ampere products are now shipping with 624 TOPS (2.4X Turing) or 1,250 TOPS with sparsity. But power is up from 300W to 400W.
Training challengers "fall short of Ampere". Cerebras has the huge chip with hundreds of thousands of cores but haven't published any benchmark data. Graphcore offers a GC2 chip with 1,216 cores and performance similar to V100. Intel/Habana has a Gaudi accelerator but it is delayed. Huawei introduced Ascend Max with similar training benchmarks to V100.
Still in the data center, inference vendors have improved efficiency. Qualcomm has the top ResNet-50 score amount merchant chips and only needs 75W of power (but only limited benchmark data available). NVIDIA A100 offers similar performance but burns 400W. Groq has a single-core design with similar performance, great for a batch-size of 1, but is 300W. There are others in the pipeline.
In the cloud, everyone seems to have their own solutions:
Software is the big challenge for most of these vendors. NVIDIA has worked for over 10 years on their CUDA framework. Accelerator vendors have to port AI frameworks to their chips and often don't implement much of Tensorflow functionality. This means that customer applications fail to even compile at first and end up requiring the chip vendors support. As Linley put it succinctly:
Resnet-50 is easy, real workloads are hard
Resnet-50 is easy, real workloads are hard
There are two sorts of accelerators at the edge: at the network edge (say in a basestation) or on edge devices (say on your phone). Cloud vendors are moving some service closer to the end use to create "regional datacenters" to reduce latency for video and voice services.
Of course, smartvehicles are edge devices:
The current deployment situation is:
The other type of acceleration at the edge is in client devices such as smartphones and new voice assistants like the latest Alexa. The big advantage of this is lower latency, it works even if the network is down, and all your private conversations are not uploaded to the cloud, potentially exposing private information.
The barriers to entry to this market are lower since the performance is lower, and as a result there are a plethora of startups piling in (BrainChip, Cambricon, Cornami, Eta Compute, Flex Logix, Grai Matter, GreenWaves, Gyrfalcon, Horizon Robotics, Kneron, Mythic, Perceive, SiMa.ai, Syntiant, and more) along with established MCU vendors like NXP and Maxim. Not to mention Intel, Lattice, NVIDIA, and others who offer embedded AI chips.
The TinyML foundation targets small sensors. Google offers TensorFlow Lite for MCUs, and Facebook offers its Glow compiler (see my post NXP Glows in Tensilica HiFi).
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.