Exploring AI / Machine Learning Implementations with Stratus HLS

19 Jun 2019 • 4 minute read

A lot of AI design is done in software and, while much of it will remain there, increasing numbers of designs are finding their way into hardware. There are multiple reasons for this including the important goals of achieving lower power or higher performance for critical parts of the AI process. Imagine you need dramatically improved rate of object recognition in automated-driving applications.

Implementing an AI application in hardware presents some key challenges for the designer.

Need to explore multiple algorithms and architectures, typically using a framework such as TensorFlow or Caffe
Need to qualify power, performance, area, and accuracy trade-offs of various architectures
Need a rapid path from the models to production silicon

In this article, I'll describe a flow that starts in the TensorFlow environment, moves into abstract C++ targeted at the Stratus HLS flow, and then into a concrete hardware implementation flow.

We have a completed implementation of the commonly-used MNIST digits example that attempts to perform character recognition of images of hand-written digits. The approach we took for this implementation was to first model our recognition algorithm in the TensorFlow framework. This was easy and productive and allowed us to train the system and extract the "weights" we would use during inference mode. The architecture of the network we used is shown below.

The next step was to implement the key TensorFlow functions in abstract C++ and pull all the key pieces together into a macro-level SystemC architecture. As shown below, the C++ code was organized similar to the TensorFlow code, but with parametrized datatypes and latency constraints to allow architectural exploration.

Then we defined a range of exploration parameters. For the data types, the training model used full IEEE floating point data types, but we decided that exploring smaller data types could be an important metric. So, in our hardware-targeted model, we decided to use fixed-point data types and to vary the width of the internal data-types from 16-bit fixed point values down to 12-bit fixed-point values, stepping by 1 bit. For latency, we decided on 3 different latency setting indicating FAST, MEDIUM and SLOW designs. We would expect that lower performance would decrease both area and power at the cost of throughput (measured in "images per second"). Reducing the bit-width of the data types was expected to decrease area and power at the cost of accuracy (measured in terms of the % of correct results predicted).

The design flow for this project looked like:

We used the following tool flow to assess power, performance, and area (PPA) and accuracy.

Stratus HLS
- We synthesized this design through Stratus HLS using a 500MHz clock with a 7 nm technology library.
- We set up multiple configurations and ran Stratus HLS to generate Verilog RTL for each configuration.
Xcelium simulations
- Each Verilog RTL was simulated with Xcelium to measure throughput and to capture simulation vectors.
Joules RTL Power Solution
- Each Verilog RTL and its associated simulation vectors (from previous step) were run through Joules to capture power metrics.
Genus Synthesis Solution
- Each Verilog RTL was run through the logic synthesis step using Genus to produce accurate area numbers.

The detailed results from our experiments were:

In the table above, the green columns were aspects of the design that we varied and the blue columns contain data that we measured. As you can see, our predicted results were pretty much in line with our expectations. An error rate of 3.3% implies that the algorithm is correct 96.7% of the time. One of the interesting tidbits from this data is that there is very little increase in error rate when reducing the data bit-width from 16 to 14, and even 13 bits is pretty close. In the latter case, moving from 16 bits to 13 bits yields a 27% reduction in power and a 25% reduction in area for only 0.7% reduction in accuracy. However, moving down to 12 bits yields a distinct loss of accuracy.

These exploration experiments produced a very broad range of possible values for our implementation as shown in the table below:

As you can see, we have data points demonstrating a wide variety of measurements that impact the design. We could choose different implementation points depending on the requirements imposed by our application. If we were looking to implement a chip that would live on the edge of the network, extreme low-power might be a requirement. If our goal was to process images coming into a cloud server or an automobile, we might ignore power and instead go for the highest frame-rate possible.

This small example demonstrates why so much AI/machine learning hardware is being built with high-level synthesis. Stratus HLS allows the designer to concretely evaluate PPA and accuracy trade-offs of multiple architectures and implementations from a single high-level model, selecting the best trade-off for the specific end application.

For more information about implementing AI and machine learning hardware with the Stratus HLS and the full Cadence flow, click here.

For more information about the Stratus HLS solution, click here.