New Algorithms for Vision Require a New Processor

2 May 2016 • 3 minute read

Vision is everywhere. If you look at the number of sensors that are shipped, then vision appears somewhere in the middle (the red bar in the middle of each column on the graph on the left below). But if you look at the amount of data generated then vision dwarfs everything else. Yes, that really is the correct graph, vision is so much bigger than everything else that the graph is all red: all vision, all the time.It takes a corresponding amount of computing power to process all that data. For some applications, that processing can be done in the cloud, but often there are real-time requirements that mean that it has to be done in the "edge node," be that a smartphone, an automobile, or something else. The biggest challenge about doing computing at the edge is getting enough processing power without exceeding the allowable power envelope. This is more of a problem in smartphones and wearables than in automobiles, for example, but you are not going to have a cloud datacenter in the trunk of your car, either.

Vision processing algorithms fall into a class where they need to be "trained" rather than "programmed." There is no way for a programmer to sit down and work out an algorithm for facial recognition or detecting the traffic signs from a scene, in the same way as they might program an algorithm to deploy an airbag if the sensors detect excessive G-force. Instead, a very general recognition engine based on neural networks needs to be built and then trained on a large number of cases such as faces or traffic signs. Typically the training requires an enormous amount of computing time, of the order of 10¹⁶ to 10²² multiply-accumulates (MACs) per dataset, and is done in the cloud or a large cohort of servers. Then the results of that learning is distilled into a compact representation, which is uploaded into the edge node. The compact representation may still require millions to billions of weights for the neural networks. Not for nothing is this overall approach referred to as "big data."

The question then is what to use in the edge node to handle the recognition. This still requires 10⁸ to 10¹² MACs per image, way beyond what can be done in a general-purpose processor in a reasonable time and with reasonable power.

Today, Cadence is announcing the Tensilica Vision P6 DSP targeting just these embedded neural network applications. It quadruples the neural network peak performance compared to the previous generation of Vision DSP. It is targeted at mobile, surveillance, automotive, drone and wearable markets. It is built on the market-leading Vision P5 DSP announced last year.

The diagram above shows the basic steps in object recognition in an embedded system. The image is cleaned up and candidate regions are fed to the neural network to perform the actual recognition. Using a specialized DSP like the Tensilica Vision P6 gives the best tradeoff between power/performance and flexibility. Obviously, designing special hardware at the RTL level can probably achieve even better performance, but it has a very long and expensive development cycle and is completely inflexible if changes need to be made. A pure software solution, or even a solution using a GPU, has flexibility but at the cost of very high power and, probably, inadequate performance.

The Vision P6 DSP announced today has 4X the neural network performance of the Vision P5, 5X the performance and imaging and vision benchmarks, 4X the number of MACs, and a 32-way SIMD vector FPU with 16 bits that makes for easy porting of GPU code. The block diagram is shown above. In addition to the semiconductor IP itself there is also a rich software ecosystem for imaging and vision from a wide ecosystem of partners supplying libraries for ADAS, facial detection, HDR photography, image stabilization, and more.

The summary is that neural networks have been chosen for the next generation of many computer vision applications, but this requires on-chip solutions with high computational capacity and lower energy. The Vision P6 DSP is the best solution available today.

Carl Jung said that "your vision will become clear only when you can look into your own heart." But then he didn't know anything about image processing and neural networks.

Previous: NVIDIA: Ten Months of Emulation on Palladium, Hours to Bring-Up

Previous: Embedded Vision: The Road Ahead for Neural Networks and Five Likely Surprises