A New Era Needs a New Architecture: The Tensilica Vision Q6 DSP

11 Apr 2018 • 5 minute read

smartphone camera There is a trend for increasing sophistication in vision and in artificial intelligence (AI). There are many drivers of this, but two of the most important are the advanced capabilities of high-end smartphones and the demands of ADAS and autonomous driving.

For smartphones, face detection requires a mixture of vision and AI processing, and the requirements are increasing all the time. In particular, it needs to work under non-ideal conditions, when the face is turned to one side, wearing a hat, with the face partially obscured by a scarf, and so on. The cameras in smartphones are getting increasingly sophisticated, using multiple sensors to do things like high-dynamic range (HDR) imaging, hybrid zoom, image stabilization, and more.

The other "driver", ADAS, has two big requirements. One is performance for functions like pedestrian detection, driver attention monitoring, and lane departure warnings. The other is that the power needs to be kept low. Most of the chips are in limited air-flow environments (no fans), and sometimes in extreme temperature environments, such as behind the central cabin mirror, up against the windshield.

There are other drivers too, such as virtual reality, augmented reality, robotics, drones, and surveillance cameras, that have similar characteristics.

These functions, vision and AI, are often combined into a single camera pipeline, starting with things like noise-reduction, doing vision post-processing, and then classification and segmentation of the image. Sometimes, the AI is done first, for scene classification, prior to doing sophisticated image processing such as HDR or Bokeh (blurring the out-of-focus parts of the image, the Japanese for blur is boke).

Looking at the big picture, there are three things going on:

Need to combine vision and AI processing in a single DSP
Need for increased performance
Aggressive power limits

These three requirements together mean that a new generation of Tensilica Vision processors is required to address them.

Introducing the Vision Q6 DSP

Looking at the options for increasing the performance, there are several choices:

Increase the SIMD width or VLIW slots: However, this gets increasingly difficult to program to avoid a lot of idle resources. It is much easier to add processing power than it is to use it.
Multi-core: This would instantly double (or more) the available processing power, but requires twice the local memory, and also suffers from being hard to balance so that all cores are always busy.
Increase frequency: Obviously, this increases performance but at the cost of increased area (cost) and increased power.
Create a new architecture with a fundamentally higher performance.

q6 pipeline Cadence decided on the fourth option. The new Vision Q6 DSP is our fifth generation vision and AI DSP. It has a 13-stage processor pipeline that can achieve 1.5GHz processor frequency (in 16nm), 50% higher than its predecessor the Vision P5 DSP, but in the same silicon area. It is also 1.25X the power efficiency of the Vision P6 DSP at peak performance. On standard imaging kernels, the performance improvement can be as much as 2X.

The pipeline consists of:

Instruction front-end (3 stages)
Instruction decode and dispatch (2 stages)
AR/scalar integer pipeline (5 stages)
Vector DSP (3 stages)

Loads and stores are handled separately following the first stage of instruction decode. Also, vector execution has been separated from scalar execution, which provides increased scalar performance and the opportunity to add a scalar data cache. This cache can lead to up to 50% improvement in scalar cycle count, and obviously, the slower the memory is, the more advantage the cache delivers (compared to not having a cache).

Another development is the addition of a branch predictor. The deeper the pipeline, the more important this is, since missed predictions require the pipeline to be flushed and re-filled.

The Vision Q6 DSP is backward compatible with the Vision P6 DSP, so any code written for the Vision P6 DSP will run unchanged on the Vision Q6 DSP. However, there are new instructions for the Vision Q6 DSP, so the converse is not true.

The Vision Q6 DSP is designed to work well in multiprocessor environments, using the AXI4 for interconnect. Either multiple Vision Q6 DSPs can be used, or a Vision C5 DSP can be added to the system to partition AI and vision processing.

Programming the Vision Q6 DSP

android neural networks It is all too easy to design a lot of processing power onto the silicon, and then not be able to access that power from the higher level where the programmers operate. AI is an area where there are several frameworks that are widely used. The Vision Q6 DSP supports:

Android Neural Networks, which enables on-device AI for Android-based platforms, such as non-Apple smartphones
TensorFlow, TensorFlow Lite, Caffe
Custom layer support, for people who want to augment standard networks with a unique capability
Standard neural networks are broadly supported (MobileNet, Inception, Resnet, VGG, Segnet, FCN, YOLO, RCNN, SSD, etc)

The one that is relatively new in this list is Android Neural Networks (ANN), which was released in October last year, about six months ago. This provides a neural network API that makes it transparent whether the neural network is implemented on the application processor (normally a high-end Arm processor), or on a specialized AI DSP. The Vision Q6 DSP supports ANN for Android 8.1 (Oreo). It provides real-time optimized execution. The diagram on the right shows how the pieces of ANN fit together. The Vision Q6 DSP fits in the middle as a specialized processor (or perhaps on the left as a DSP, it's just a matter of terminology).

xnnc

The existing Tensilica AI tool-chain, shown above, is known as XNNC (Xtensa Neural Network Compiler). This takes a neural network descriptor (in Caffe or Tensorflow) and compiles it down to code that will run on the Vision Q6 DSP (or Vision P6 or C5 DSPs). It automatically handles a lot of the housekeeping, such as DMA and tile management. Normally Tensilica AI processing uses 8-bit weights. In the last couple of years, a lot of work has been done that has shown that 32-bit floating point and 8-bit fixed point provide essentially the same accuracy (around 0.5% quantization error). This is a huge saving in power and area.

For specific algorithms, such as HDR, voice authentication, image stabilization, and so on, Cadence works with a broad range of partners who are experts in these specific areas. We are also the chair of the OpenVX working group at Khronos, driving standards for vision processing offload.

More Details

See the Tensilica Vision processor page.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.