Neural Nets Hit the Roofline—Memory for AI

26 Nov 2018 • 4 minute read

At the recent Linley Processor Conference, there were two interesting presentations on the memory demands of AI processors. The two presentations were:

Steven Woo of Rambus on Memory Systems for AI and Leading-Edge Applications.
Ryan Baxter of Micron on AI Shaping Next-Generation Memory Solutions.

Some of their slides were so similar that Ryan had to say "I promise we didn't confer beforehand." By the way, Ryan last appeared in Breakfast Bytes just a couple of months ago when he presented, along with Cadence's Marc Greenberg, at TSMC's OIP. See DDR5 Is on Our Doorstep.

Before going any further, let me give you Ryan's table that summarizes a lot of memory interfaces in one place. Note that the data rate is per pin (and is in gigabits per second), and the bandwidth is all the pins together (and is in gigabytes per second).

Steven Woo

Steven started off talking about how faster computer and large datasets have enabled modern AI. But more performance is needed, and that's a challenge with Moore's Law ending (economic challenge), and Dennard scaling ending (power challenge), and general-purpose microprocessor architectural improvement ending (so you can't just write software and wait for it to get faster). This has led to specialized AI processors of one sort or another.

Generally speaking, these processors do a lot of parallel processing, which involves some operations for each data byte. The above graph shows the roofline model. On the X-axis is the number of operations performed per byte, the operational intensity. On the Y-axis is the performance, the number of operations per second. The exact shape of the graph depends on the architecture. Applications with low operational intensity are memory bound (because there is plenty of compute power). Applications with high operational intensity are compute bound (there are not enough operational units to keep the memory interface saturated). The point of transition, which is neither compute nor memory bound (or both, depending on your perspective) is called the "ridge point" at the point that the computational elements are maxed out. From then, increased operational demand doesn't result in more performance since the compute elements (typically MACs) are all working non-stop.

This graph shows some actual rooflines. The orange is Intel's Haswell (a general-purpose x86 processor). The red line is NVIDIA's K80 (a GPU), and the blue line is Google's TPU v1 (for more details on that see my post about the keynote that Google's Cliff Young delivered earlier the same day, Inside Google's TPU). What this shows is that TPU is largely limited by memory bandwidth. Increasing the performance of the computation more without increasing the memory bandwidth will have no effect. TPU is probably representative of most specialized deep learning processors: they have so much compute power that they max out the memory bandwidth available.

There are things that can be done that both help optimize both. Reduced precision arithmetic both requires less hardware and energy in the compute units, but also makes better use of memory bandwidth, too. Mixed precision is increasingly used (in TPU for example) with FP16 for multiplies and then FP32 for the accumulate, and new datatypes geared towards AI (such as BFLOAT16, created for TPU).

Steven went into details on three attractive memory architectures for AI, and the tradeoffs between them. His summary was:

On-chip memory: Highest bandwidth and power efficiency, lowest latency...but limited storage capacity.
HBM: Extremely high bandwidth and power efficiency, but higher cost and more challenging integration, and design complexity.
GDDR: Good tradeoff between bandwidth, capacity, power efficiency, cost, reliability, and design complexity...signal integrity more challenging though.

Ryan Baxter

Ryan started out with some numbers showing the sheer amount of data being created. I'm sure you've seen plenty of graphs of this. One of the drivers in datacenters is that AI training workloads, compared to general workloads, required 6X as much memory (DRAM) and 2X as much storage (SSD). That wouldn't be a challenge if AI was a minority interest, but AI-capable servers are going from a few percent of servers in 2017 to almost half of servers in 2025. It is a big opportunity for the cloud providers since standard compute is $0.22 per hour whereas AI-capable accelerated compute is $11.20 per hour, 50 times as much.

All this together has unleashed demand for more memory and storage, as you can see in the above chart.

Ryan had his roofline diagrams, too, showing how many core-accelerator-based solutions are taking over from CPU as the most efficient compute engines but they are limited by memory bandwidth and require unique memory footprints.

Above is his table showing the tradeoffs between the different memory architectures: DDR4/5, GDDR5/6, and HBM.

Summary

Both Micron and Rambus supply products (IP in Rambus's case, actual memories in Micron's case) in all these different segments so their tradeoff charts are not especially trying to push you into "their" solution since they have all of them.

The two takeaways from both presentations are how to read a roofline diagram, and the fact that increasingly AI systems are limited by memory bandwidth more than compute power. In particular, if you are involved in AI systems, don't overdesign the compute at the expense of memory bandwidth—you will be disappointed.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.