Never miss a story from Breakfast Bytes. Subscribe for in-depth analysis and articles.
Earlier this week, Dave Apte presented a webinar on AI Accelerator Design with Stratus High-Level Synthesis. It was timed for the East Coast and for Europe, so I had to get up at 6:00am to attend, but since I'm still somewhat jet-lagged from getting back from Israel at the weekend, I was awake by 4:30am so I didn't even need my alarm to wake me.
The example design that Dave presented recognizes hand-written digits. It is trained on a standard MNIST data set which provides 60,000 28x28 pixel images, and a further 10,000 images for testing. A neural network is constructed and trained using TensorFlow, and then the trained model metadata is extracted (the weights). The network has six computational layers, two 2-D multi-channel convolution layers, two pooling layers, and two dense layers (see the diagram).
The TensorFlow operations in the network consist of 2-D convolution (applied to 3x3 pixel subsets of the image), max pooling which selects maximum value from 2x2 working sets to reduce the size of the image by a factor of 2, and then matrix multiple for the final two layers.
At a conceptual level, the TensorFlow operations are analogies of SystemC hardware modules. Graph connections in the network are analogs of SystemC I/O interfaces. And a whole TensorFlow session is like a SystemC simulation. That's the basic way that the network is made synthesizable: Each TensorFlow function has a corresponding SystemC module, which defines the I/O interface and threads, along with the algorithm. The modules are C++ templates for easy configuration.
The conversion is not completely automatic. You, as the hardware design, have to figure out the correct dataflow architecture that meets your goals, such as whether to use a streaming interface, a memory interface, or a bus interface. TensorFlow works with entire tensors but in hardware, the data has to be handled in smaller chunks.
For the example network, the network dataflow is implemented by streaming and line buffer interfaces. Each module receives and produces data one pixel at a time in raster scanning order. Some modules require specialized buffering to process the data correctly (e.g., the 3x3 convolution).
The project ends up with six SystemC modules to be synthesized to RTL, one for each layer of the TensorFlow network. Multiple configurations of each module can be created for design space exploration, meaning multiple RTL implementations from the same SystemC code. It also contains a SystemC testbench to feed the weights, biases, and test data to the network, and compare the results to golden results from TensorFlow to measure accuracy. TensorFlow runs with full floating-point accuracy, but our implementation is in fixed point.
The first step is algorithmic verification, to examine the performance (percentage accuracy) of the model with the bit widths chosen. You can see from the table that the accuracy drops from 97.1% with TensorFlow (floating-point) to 96.68% with 16-bit fixed-point hardware. That doesn't change much down to 13-bit fixed-point hardware, and the drops dramatically at 12-bit hardware. Note that it is up to you as the designer to decide what accuracy level is acceptable. The higher the accuracy, the greater the area and the greater the power consumption.
Next, Stratus HLS synthesizes each module to RTL. At this point, you need to decide what technology and library is being used, and what clock period you want. Unlike with traditional RTL, where you can only afford to build a single RTL implementation, you can create several alternative RTL targeting different PPA. Without changing any code, you can vary the bitwidths, and you can vary the synthesis configurations for each module, trading off latency against hardware area, or slow and small versus large and fast. For example, in the table you can see that we have found a solution that is 19% smaller and 14% less power for only a 0.64% reduction in accuracy. In the table, fast is the fastest possible hardware, slow is the smallest possible hardware, and medium is a compromise. You can make more radical changes, too, such as synthesizing to a different library, or with a different system clock frequency.
Once RTL is produced by Stratus HLS, the RTL needs to be verified for performance testing (latency, throughput) and PPA assessment. There is also connectivity and interoperability with the rest of the chip to be verified. Note that the same SystemC testbench is used to both verify that the algorithm is correct and sufficiently accurate, and then verify that the RTL is correct and meets the desired PPA.
In summary, the process is:
If you have access to Cadence support, there is a Rapid Adoption Kit that holds all the files associated with this demo neural network. You can also visit the Stratus HLS learning center.
There is a Stratus product page (accessible to anyone).
Sign up for Sunday Brunch, the weekly Breakfast Bytes email.