HLS for AI/ML Models: TensorFlow to RTL

19 Oct 2022 • 3 minute read

Artificial Intelligence (AI) plays a key role in semiconductors to meet the challenging demand and rising customer expectations. But implementing these AI models in Hardware (FPGA) is challenging. AI developers generally use TensorFlow/Caffe model, which is non-synthesizable. This creates a vast expanse between developers in the AI team and digital hardware experts. So, it takes a long time for the hardware implementation of these AI-based models. We need an automated solution that may handle the TensorFlows used for algorithm design (AI) and help to minimize the time to market.

Learn how Stratus HLS helps to bridge this gap by converting AI models (TensorFlow) into SystemC/C++ or RTL. Also, it helps measure the power and area correctly rather than doing guesswork.

In this blog, I will describe the flow from the TensorFlow environment, which moves into abstract C++ with Stratus HLS.

Why Implement AI Algorithm in Hardware?

AI/ML and related applications have improved everything they touched. The semiconductor industry is no different; the primary reasons to implement AI in hardware are as mentioned below:

To get the best performance of custom hardware for abstract AI models (TensorFlow/Caffe)
Achieve the desired PPA, especially for edge devices like battery-powered sensors, and ultra-low power, which run an AI network and need the best custom hardware solution

What are the Design Challenges for Designers?

Implementing an AI application in hardware presents some critical challenges for the designer. The applications can be susceptible to power, area, and performance. For instance, achieving the desired PPA, especially for edge devices like battery-powered sensors, and ultra-low power, which run an AI network and need the best custom hardware solution. Other challenges include:

Exploring multiple ML algorithms to get the best custom hardware design
Ability to quantify PPA and neural network (NN)accuracy tradeoffs
Rapid design of production silicon to stay ahead of the competition

How are TensorFlows Converted to RTL?

Stratus (HLS) automatically creates high-quality register-transfer level (RTL) design implementations for ASIC, system-on-chip (SoC), and FPGA targets from high-level IEEE 1666 SystemC and C++ descriptions.

High Level Synthesis - AI to Hardware

TensorFlow is based on Python, and HLS cannot synthesize Python into RTL. Stratus HLS converts AI to SystemC/C++ or RTL using the steps mentioned below. Here, Each SystemC module corresponds to one TensorFlow function. It starts with creating the NN in TensorFlow and extracting the model metadata (weights and bias values). The same models are implemented in SystemC and parametrized with the weights and biases obtained from training.

AI to RTL Implementation

The parameters can be fine-tuned to match the performance requirements if needed. This kind of tradeoff analysis is tough to do in traditional RTL flow. In the Stratus environment, we can have multiple HLS configurations that enable multiple RTL implementations for the same source code.

Multiple Architectures HLS

It helps perform the tradeoff analysis and gives more chances to find the ideal architecture, which may not be the case while working with traditional RTL flow. SystemC links TensorFlow to Stratus HLS, as shown below, because TensorFlow operations are analogous to hardware modules, as written in IEEE 1666 in SystemC. The TensorFlow code and its equivalent SystemC code are shown in the figure below:

TensorFlow to SystemC and RTL

Use-Case

The implementation of the commonly used MNIST digits example attempts to perform recognition of images of hand-written digits as in cheque deposit application.

Neural Network for Pattern Recognition

Input is a small image to NN, and the output is a matrix of 10 numbers, with a probability that the input image is the corresponding number. The highest value gives the recognized digit.

Results

AI/ML applications are first defined with a framework (TensorFlow). Cadence provides a straightforward flow to move from TensorFlow to custom hardware for edge applications. Stratus HLS enables power, performance, area, and accuracy tradeoff. To showcase the benefits, we have created a network in TensorFlow and implemented it using multiple-bit width using fixed-point type for the hardware implementation in SystemC. For each bit width, we did Fast, medium, and slow implementation in terms of latency. The detailed results from our experiments were:

AI/ML models and PPA

Moving from 16 bits to 13 bits yields a 14% reduction in power and a 19% reduction in area for only a 0.64% reduction in accuracy, which may be selected per decision.