Linley: Training in the Datacenter, Inference at the Edge

24 Apr 2018 • 6 minute read

neural net brain In mid-April I was at the Linley Processor Conference. As usual, Linley Gwennap gave the opening keynote. He titled it How Well Does Your Processor Support AI? which was the perfect straight man question for us since we were using the conference to announce our new processor, the Tensilica Vision Q6 DSP. My blog post that morning A New Era Needs a New Architecture: The Tensilica Vision Q6 DSP covered the details, and Cadence's Lazaar Louis drew the short straw and presented it in the very last session on the second day of the conference. The Q6 is a new architecture with a longer pipeline, branch prediction, and other improvements. Compared to its predecessor, the Vision P6 DSP, it is the same area, 50% faster, and 25% more power efficient.

Training in the Datacenter

Linley's agenda looked at AI in various places, since the processor requirements are very different: in the datacenter, vehicles, client computing, and IoT. Then he looked at some other non-processor IP for AI.

If there is one big trend in AI, it is to use neural networks to the virtual exclusion of other techniques. The main way that inference is done is to pre-process the data, upload to the cloud, and do the inference in the datacenter. Alexa and Siri, for example, work that way. The detection of the trigger word ("Alexa", "Hey Siri", "Okay Google") is done on the device, but the natural language processing is done in the cloud.

However, that is changing. In the future, the datacenter will be used to train the networks, but an increasing amount of the inference will be done on the edge devices. It's not a big deal if Alexa takes an extra tenth of a second to tell you tomorrow's weather, but if your car needs to decide whether a traffic light is red or green then it has to be done without delay. And that's before worrying about whether the network goes down, or the datacenter servers are overloaded.

Most AI research has been centered on datacenters since that is where the bulk of computer power is to be found. Furthermore, a lot of the training is done using NVIDIA GPUs. Amazon's AWS has instances with up to 8 Tesla GPUs. So all the training is done with 32-bit floating point (since that's what you get with a GPU). However, research over the last few years has found that inference really does not need that much precision. It is almost surprising how little reduction in inference efficiency comes from using 8-bit (or even fewer) data and weights.

nvidia volta core NVIDIA dominates the training in datacenters. They reported $1.8B in datacenter revenue last year. The Tesla V100 "Volta" has 80 cores times 4 warps x 32 single precision operations x 1.37GHz making 14 teraflops. The 4 HBM2 yields 900GB/s from DRAM.

But GPUs being great for neural network training is almost a coincidence. That's not what they were originally designed for. Google designed their own chip, the Tensor Processing Unit (TPU), and have had thousands of them in their datacenters for the last few years. The heart of the design is a systolic array (I don't even know if Linley used the word "heart" as a play-on-words since systolic is a heartbeat) with 256x256 MACs, and data moving from one row to the next on the heartbeat. Unlike GPUs, it doesn't use 32-bit floating point, just 8-bit, which simplifies the implementation (and keeps the power down—the Intel CPU that drives the TPU consumes more power than the TPU).

Microsoft has been using FPGAs for some time, initially to speed up search but recently to give 2-3X improvement in neural network performance. Just a few days ago, Intel announced that some of its customers are now rolling out servers with Altera (well, officially Intel PSG) FPGAs already in them, hoping to kick-start an application ecosystem.

Inference at the Edge

waymo vehicle One of the biggest edge devices, both in physical size and in market size, is the automobile. Current cars have ADAS features, such as lane departure warnings and automatic emergency braking. Some are beginning to have more advanced features, closer to self-driving in the least demanding situations such as freeway driving or stop-and-go traffic. The electronics is expensive, so like most features in cars, capabilities will first appear at the luxury end where the price can support deployment (and in commercial vehicles). Linley's prediction is that level 3 autonomous driving will drop below $5,000 by 2022, which will drive adoption since savings in insurance will perhaps be more than the increase in price, making it a no-brainer. True level 5, where the cars can handle everything and don't even need a steering wheel or other means for a human to take over, may be 10 or more years away. After all, even automated train systems like London's Dockland Light Railway (DLR) still have a (hidden) control panel for manual control of the train, and since the vehicles are restricted to running on rails, it is a considerably simpler problem.

But it's not just transportation where the inference is moving to the edge. Most high-end smartphones now have an AI accelerator block, such as the Bionic neural engine inside the Apple A11 application processor, of the neural engine from DeePhi that is inside the Galaxy S9 application processor, and the MediaTek P60 which contains our Tensilica P6 DSP (I wonder if their next chip will be called the Q60?). They will move into mainstream phones in the coming year or two. Linley feels AI processors may show up in PCs too. I'm not so sure since increasingly the world is mobile, and PCs are there for when you need a bigger screen and keyboard.

The other edge devices that are getting AI inside are IoT devices, such as voice assistants ("Alexa, what's a voice assistant?") and smart security cameras that can filter most of the uninteresting data (almost all of it) and only upload to the cloud when something such as a visitor needs more processing power to identify. Drones require some level of vision processing and AI to avoid obstacles, or to follow a skier or cyclist. There are also "tiny" engines for IoT from a number of startups. Consumer IoT is so price sensitive, and IoT devices to-date have been nice-to-have only, that it is not likely to take off as fast as IIoT, the Industrial Internet of Things. The business case is easier to make once deployment is cheaper than continuing to do things the old way, so everything is getting "smart": smart meters, smart parking lots, asset tracking and more.

One interesting development is NVIDIA offering an open-source neural network accelerator based on Xavier. It is too recent to guess whether it will have a major impact on the market. Linley didn't seem to know much about it yet (and I know even less for now). But it seems to be in the 2 teraMAC/s range, although that is almost meaningless without knowing the silicon area.

Next Linley Conference

Trick or treat? The Linley Fall Processor Conference is on Halloween (and All Saints Day, November 1st). No details yet, beyond the date, but here is the Linley events page.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.