The Second Embedded Neural Network Symposium

13 Feb 2017 • 8 minute read

A couple of weeks ago, Cadence held the second embedded neural network symposium (ENNS). Neural networks are a hot topic. In fact, this was the biggest event ever held on the Cadence campus with over 300 people registered (I think around 200 showed up). The second largest event was... last year's ENNS.

As I said, neural networks are a hot topic at the moment, for things like machine translation and automatic analysis of photos. These applications run in the cloud. But embedded neural networks, meaning neural networks that run without doing datacenter-based analysis, are an important area of research and development. One of the big drivers—but by no means, the only one—is autonomous driving (and its fore-runner ADAS), which requires a lot of analysis to be done in the vehicle without relying on its uplink to a cloud backend. When pedestrians step into the road in front of a vehicle, only the on-board processing is going to stop the car safely.

The computing devices used for embedded neural networks range from IoT edge nodes, through cell-phones, up to vehicles. Obviously, there is a big difference between the amount of power available to an IoT edge device and a vehicle (especially in commercial vehicles), but all these applications are characterized by computation limited by power and cost.

Chris Rowen was the master of ceremonies for the day. Until recently he was the CTO of Cadence's IP division, and prior to that, he was the founder and CTO of Tensilica. He left Cadence to create a new company, Cognite Ventures. Inviting Chris back wasn't as weird as it seems since this wasn't really a full-on Cadence event. It was held on the Cadence campus, but only one of the speakers, Samer Hijazi, was from Cadence. The rest were from a wide range of companies and universities (well, one, Stanford).

Chris pointed out that the subtitle of the conference, Deep Learning: the New Moore's Law, wasn't meant to imply that something in neural networks was doubling every year. But just as Moore's Law has driven a whole gamut of semiconductor applications that nobody would have thought of forty years ago, so deep learning will drive a whole range of applications beyond the ones that were on the agenda at the symposium. Several speakers pointed out that there is a lot of hype around neural networks, but while the hype is ahead of reality, there really is a solid reality coming up behind.

The really big change that neural networks bring about is programming by example. Instead of developing complex and fragile algorithms to recognize, say, faces, the algorithms are trained on a huge dataset. This is especially the case with vision, followed by machine translation, in terms of very active development. This approach—training followed by running the algorithm on a neural network—is becoming the primary mechanism for dealing with the most complex problems. The best way to sum this up is in the phrase "training is the new programming."

The day opened with Jeff Bier talking about embedded vision. I'll cover his talk in its own post later in the week. The keynote was by Kunle Olukotun, who is the Cadence-sponsored professor at Stanford (I'm sure there is a more official title). I was chatting with him before the symposium started and asked him how come he had such a strong English accent. He was actually born in London, moved to Nigeria when he was 9 (having got the accent), and then came to the US to go to college when he was 18 or so. I will cover his talk in its own post later in the week.

Kai Yu, Horizon Robotics

Next up was Kai Yu who is the founder and CEO of Horizon Robotics. That sounds like the typical name for a silicon valley startup, but in fact, they are based in Beijing. Before founding Horizon, he was something senior in Baidu's deep learning developments. As everywhere, the main driver for this business is automotive. China is now the #1 market for cars—but it is also the #1 market for automotive fatalities. Another business driver in China is security cameras. As Kai put it, "Nobody in China cares about privacy," so cameras will have embedded vision within a few years. Horizon is focused on three verticals: automotive, smart life, and public safety. He said that Horizon counts as a mature company since their average age is 32 and most have overseas experience, compared to his team at Baidu (or maybe it was the whole company?), where the average age was 25. The opportunity here is huge since by 2040 there will be more robots than humans.

Samer Hijazi, Cadence

Samer took a look at what it will take for deep neural networks (DNN) to be successfully implemented in embedded devices. The state-of-the-art in DNN implementation is 40W/TMAC (watts per tera-multiple-accumulate), and a typical application requires 4 TMAC, meaning 160W. So even a 100X improvement in hardware efficiency isn't enough. Current implementations are simply performing too many MACs per pixel (or equivalent in other domains). There are four ways to reduce this:

Optimize the network architecture (minimize multiplies per pixel)
Optimize the problem definition (minimize the number of pixels to be processed)
Minimize the number of bits in the representations (make each multiply cheaper)
Utilize optimized hardware for CNN (reduce the power at the silicon level)

One approach to optimizing the network architecture is to start with a cloud-based solution, but that is obviously overkill for embedded. It turns out that by dialing back various aspect of the net, it is possible to keep the performance at almost the same level with considerably few resources (power and cost) (see the diagram above). This process can be automated. Cadence has been doing work with a generic superset network architecture called CactusNet, where the network architecture can be incrementally optimized and analyzed for sensitivity. Very good results have been obtained with the German traffic sign database (which is a sort of best-case for vision recognition, because there are a limited number of signs and they have been designed to be easily recognized in poor visibility conditions). CactusNet can get the same recognition rate as other leaders, but with two orders of magnitude lower complexity.

One area that is surprising is how much the precision can be reduced without affecting performance. In the cloud, neural nets are typically using 32-bit floating point. In embedded applications, it often turns out that 8- or even 4-bit precision gives similar results. Neural networks are analyzing noisy data, and reducing precision is, in one sense, just adding more noise. A little more complexity in the network can handle all the noise, both in the data and the quantization noise.

When it comes to optimized hardware, the Cadence solution is the Tensilica Vision P6 DSP. This is a CNN processor optimized for:

Minimizing pJ/MAC
Minimizing data movement (which is power-hungry)
Having enough MAC/sec for the application
Keeping the hardware resources highly utilized

Song Han, Stanford

Next up was Song Han, who is a doctoral candidate at Stanford. Chris was fulsome in his praise introducing him, pointing out that Song had led the way on many efficiency optimizations, significantly more aggressive than people thought was possible. As if that weren't enough, he has also designed application-specific hardware for inference.

His approach is to compress the sparse neural network, then quantize and reduce the bit widths. This compressed dataset of weights is then transmitted to the embedded neural network and specialized hardware is then used to perform the recognition. In the limit, some of the quantization goes down to two bits, representing positive, zero and negative.

Chris Rowen followed with his own presentation, which I'll cover in a separate post later in the week.

Forrest Iandola, DeepScale

Next, another startup. DeepScale has its roots in ten years of work at Berkeley. It has some EDA roots; the team includes Kurt Keutzer, who used to be CTO at Synopsys before becoming a Berkeley professor, and Don MacMillen (the one who isn't a comedian), who also worked at Synopsys for many years, having been at VLSI Technology for some time before that (like the Don McMillen who is a comedian, who was an IC designer at VLSI).

Forest talked about FireNet, a neural network architecture with few weights, built out of 1x1 and 3x3 modules. The motivation is that 3x3 modules contain nine times the weights and nine times the computation of 1x1. FireNet is built out of Fire modules (see diagram on right). Their latest architecture, still built out of Fire modules, is called SqueezeNet.

Ren Wu, NovuMind

The last presentation of the day, since Anshu Arya got delayed, was by Ren Wu of NovuMind. He talked about how computational power is very important. The defeat of Kasparov at chess and AlphaGo relied a lot on sheer compute power and not just clever algorithms (although I have heard the opposite view, that AlphaGo could be run on twenty-year-old hardware; we just didn't know how to do it back then). Ren talked about the standard two-phase approach, using supercomputers for training and then deploying the resulting models in a wide range of devices.

The focus of a lot of implementation is on dedicated hardware. A general-purpose CPU has a certain level of performance, GPUs and DSPs are higher still, and highest of all is dedicated hardware. (Note that this is a logarithmic scale, so in fact, there is about a 1000X difference in efficiency between general-purpose CPUs and dedicated hardware!)

Coming Soon to a Blog Near You...

Over the next few days I will write posts about three presentations that I picked out, which I feel are of general interest outside neural network specialists:

Jeff Bier, of the Embedded Vision Alliance (and other things), talking on When Every Device Can See
Kunle Olukotun, of Stanford University, on Scaling Machine Learning Performance with Moore's Law
Chris Rowen, of Cognite Ventures, on Neural Networks: The New Moore's Law

The week will wrap up with a look at the panel session and a look to the future of neural networks.