Get email delivery of the Cadence blog featured here
There is a widening split in the approaches being taken by academic attempts to built deep neural networks (DNNs) with more powerful recognition, and implementation approaches that are trying to be more practical. The academics are largely interested in publications and prestige ("world's best recognition results") without any consideration of the resources required ("ten thousand servers"). On the other hand, the practical engineers are aiming for a different tradeoff, to get recognition very close what can be achieved with unlimited resources, but with a more practical resource budget, because they want to build a real product. The most extreme are embedded approaches, where the network has to be implemented on-chip with a very limited power budget.
The challenge is immense. State-of-the-art hardware today consumes about 40W/TMAC, but 4 TMACS are needed for many real-time DNN applications such as glass surround analysis or gesture HMI. But that means 160W (that's a big light bulb, not something you are going to be putting in your pocket or purse). Even reducing it by a couple of orders of magnitude isn't really enough. Embedded device power budgets and form-factors cannot accommodate the current trend of DNNs.
At the recent Autosens conference in Detroit, Cadence's Michelle (Xuehong) Mao presented on this problem in a talk titled What Will It Take to Bring DNN to Embedded?
The fundamental problem, Michelle said, is that DNNs use too many multiplies and data moves per pixel, and they all consume power (and time). To solve the problem, there are four things that can be done:
Here at Cadence, we are not in the business of designing DNNs for real-world applications, that's what our customers do. However, we very much are in the business of supplying neural network processors, or enabling the design of hardware for DNN implementation. In order to drive our technology, we created CactusNet. This is a DNN that is a sort of superset of the existing well-known widely available DNNs that are used for development: Inception, ResNet, GoogLeNet, AlexNet, and the like. It is highly parametrizable, and depending on the parameters picked, it can implement any of the DNNs just mentioned. It is a general DNN reference architecture with lots of control knobs.
The purpose of CactusNet is two fold. One is simply to give Cadence a widely applicable DNN to use when driving its neural network processor architectures, and to ensure that it can efficiently implement a wide range of DNNs, not over-optimize for a particular one. The second is that CactusNet can be used during the optimization phase that will be described in the next section. Since it is the "one ring to rule them all" of DNNs, this means that the optimization approach is also generalized to using any of the common widely available DNNs.
CactusNet is not some lowest common denominator implementation, it is state of the art. On the German Traffic Sign Recognition Benchmark (GTSRB), a widely used benchmark in automotive, CactusNet has the best performance in the world, and outperforms every other known network. One surprising aspect of this is that it has at least 30X less complexity than the other nets, since it was designed from the beginning to be optimized for embedded applications.
To some extent, the more complex the network architecture, the better the recognition rate. Otherwise, there is no point in making the network more complex. But diminishing returns set in and huge increases in network complexity are required to get tiny improvements in recognition rate. But that works in your favor when you try and reduce the complexity of the network to target an embedded solution. It turns out that you can make huge reductions in complexity and give up only a tiny amount in terms of recognition.
The graph above shows how it works. Over on the right is the "starter network" in the cloud, way too big and hungry to be implemented in a chip. That is the blue dot. Incrementally, the network can be trimmed and the results measured. That is the sequence of yellow dots moving to the left. They are still too power hungry (to the right of the dotted red line that is the power budget). Eventually you get to the green dots that meet the power budget. If you go further, then eventually the recognition starts to deteriorate unacceptably.
CactusNet is a tool for doing this, since it is a superset network architecture with many knobs to dial to tune it. By measuring redundancy versus accuracy, the network can be trimmed and assessed. Both experience and academic papers indicate that this cannot be done in a single step, it really does have to be trimmed incrementally. So the sequence of steps is to trim the network and assess the result by training it again with some of the data, and then assessing the recognition rate with the rest. If it performs well, then it is a candidate for further trimming; if not, then it can be discarded and a different trim tested.
The graph above shows the result of using CactusNet on GTSRB, with world-leading performance at 30X reduction in complexity from the baseline, which is a replica of Sermanet.
This is not just a one-off. CactusNet has not been specifically designed to optimize this one dataset. The above graph shows ResNet-50, which has the best accuracy/complexity on ImageNet (a standard tagged image dataset that is widely used). But CactusNet outperforms ResNet-50 on accuracy and complexity.
Part 2 of the discussion of moving DNNs from cloud to embedded will be tomorrow.