Power-Efficient Recognition Systems for Embedded Applications

12 Jul 2016 • 4 minute read

Neural networks are hot. Las Vegas is hot, too. And there is a connection. In late June, one of the major conferences for the field, Computer Vision and Pattern Recognition (CVPR), is held there. On the Sunday before, Cadence ran a half-day training course on Power-Efficient Recognition Systems for Embedded Applications and I attended it. Trip to Vegas, yeah. Spend all day in a windowless conference room, not so much.

But the whole area is changing really fast with new developments coming all the time. Here are three things that I saw in just the last couple of days:

At Apple’s Worldwide Developer Xonference (WWDC), they announced two neural network systems APIs, called Basic Neural Network Subroutines (BNNS) and Convolutional Neural Networks (CNN). This is for the Mac, not the iPhone.
Popular Science carried a story about a pilot (as in flying planes, not as in a pipe-cleaning project) AI program developed at University of Cincinatti that shot down USAF Colonel Geno Lee, with decades of experience, in a series of flight combat simulations. Every single time.
And a less successful one: Tesla announced the first fatality for a vehicle running Autopilot. “Neither Autopilot nor the driver noticed the white side of the tractor trailer against a brightly lit sky, so the brake was not applied.” On the other hand, that is the first fatality after 130 million miles of driving when Autopilot was active, which is a bit better than humans manage. Tesla's data shows, however, that on average cars being under Autopilot have half as many "airbag deployment" accidents as under manual control.

A little earlier in the year, you probably heard about Google’s AlphaGo program beating the world go champion. Lou Scheffer pointed out in his keynote at DAC that this is almost entirely due to better understanding of these sorts of algorithms, and not due to the hardware getting much faster. As he said, “we could have done it on 1995 hardware if we knew how to.”

You can see from the above examples that one big reason that deep learning is getting so much buzz is that the results are better than conventional hand-written algorithms so recognition has moved its focus from “programming” to the availability of large datasets in the cloud that can be used for “training.” In many cases results now exceed human recognition rates.

Most neural network research is done in the cloud, which actually often means running the code on NVIDIA GPUs, such as the recently announced Tesla, using CUDA to program them. But the focus of the training in Vegas was on embedded systems.

Chris Rowen, Cadence's CTO for the IP Group, actually opened the tutorial saying that it was really about what the fundamental issues are in building optimized silicon to proliferate recognition and deploy autonomy. In a sense, it is a tutorial about efficiency.

The amount of computation involved can be quite high. Training is typically done in the cloud, once per dataset, to generate all the coefficients for the neural net, perhaps 10¹⁶ to 10²² MACs per dataset. Then the recognition is more like 10⁶ to 10¹²MACs per image. But if we are going to do that in an embedded system then we need to find ways to do this very efficiently. NVIDIA GPUs use 32-bit floating point and can dissipate as much as 250W.

Even in a vehicle that is impractically high, Chris said that he had talked to someone from one of the big 3 US automotive companies, and he said he was worried that the recognition systems would dissipate kilowatts and end up being a similar amount of power as required for traction.

Embedded systems are not like datacenters. People have different care-abouts. First, the cost should be zero, or at least not much, since sales-volume consumer devices are very price sensitive. Next, zero power, or at least not much. Most of these devices are battery powered or in environments where only passive cooling is available (no fans, for example). Then high enough performance for whatever the device has to do. If it is a recognition system with the sort of requirements CNNs need, then this needs good performance. Plus, of course, small size and no bugs. Short design schedule and so on. It is a software-hardware co-optimization problem, not just writing software algorithms and running them on general-purpose hardware.

There is a spectrum of implementation fabrics from server GPUs to FPGAs to embedded GPUs. But the most promising are specialized vision DSPs and special CNN DSPs.

The challenge is how to push up recognition rate and push down the resources. These sound incompatible, but in fact academia and industry are finding ways to improve both at the same time. One well-known benchmark is the German traffic sign recognition benchmark, and every year the record improves (see the diagram).

The first thing to change is not to try and use 32-bit floating point but to explore different types of hardware platform. Sometimes really good results can be obtained with 4-bit fixed-point and then there can be up to 50X memory footprint reduction.

This type of optimization can be done by hand, but to get the best results there needs to be some new automated optimization, something closer to synthesis than to programming. Then we can create compelling embedded realtime applications that can fit in a doorknob, a car, or a phone, and we can transition from server-based to the embedded world.

One challenge is that neural networks are not well understood by embedded architects, and embedded systems are not well understood by neural network experts. The rest of the day, which will be covered in posts later this week, was a step towards fixing that.

Next: How to Optimize Your CNN

Previous: Last Chance to See Tsukiji Fish Market