HLS Optimizations You Can't Do By Hand

10 May 2019 • 3 minute read

In my previous blog post, I talked about the Quality-of-Results (QoR) that are achievable using High-Level Synthesis tools like Stratus HLS and the fact that exploration of multiple RTLL architectures is often the feature that enables HLS users to beat hand-coded RTL flows in terms of QoR. That article raised the notion that "project schedule" is a critical factor when judging comparative QoR, and it often gets left out of the equation. Rather than asking, "Can HLS get better QoR than hand-coded RTL", it's important to ask "Can HLS get better QoR than hand-coded RTL in two weeks"?

In this post, I will focus more on some things that HLS can do that you pretty much cannot realistically do by hand. The example I will use is similar to one that was presented to us by a production user of HLS, so it actually does reflect reality out there in the industry. I will show an example where HLS can find significant sharing opportunities that are difficult to find and implement by hand, and where using HLS allows you to get the area benefit of that sharing without turning your source code into an unmaintainable mess.

In general, the kinds of optimizations that can be performed by HLS tools that are beyond most hand-coders fall into the camp of dealing with massive complexity. Let's look at a simple example and then I'll extrapolate that to more complex cases.

In our simple example, we have an algorithm that implements a DCT and one that does an IDCT. These two functionalities will be mutually exclusive, in that if the DCT is active, the IDCT is idle and vice versa. Both designs are pipelined with the same initiation interval (II = pipelining throughput rate) which is "1" in our example.

The top level code for this looks like:

If we look at each of these algorithms in isolation, a pipelined version of each algorithm will use (given the constraints we provided) 192 multipliers and 400 adders. A logical organization of this would be to:

Implement a DCT algorithm
Implement an IDCT algorithm
Build some switching logic and combine the algorithms in a single module with a mode switch

The problem with this approach is that it is inefficient with respect to sharing of resources. In the case where we implement these separately, we see:

With an HLS flow, however, the HLS tool can automatically share all the resources across the entire implementation (if the costing functions determine that sharing is of value) and we see:

Here, we see that the combined algorithm still only has 192 multipliers and the adder count is only 520 (some smaller adders were not shared since the shares were calculated to be not beneficial).

Now, with 2 algorithms as similar as the DCT and IDCT, it might be possible for a hand-coder to do this level of sharing. However, as the requirements become even more complex, that task becomes almost impossible. Imagine the case where you have completely disparate algorithms AND they are pipelined at different throughput rates.

The second dimension of this that is worth noting is "how do you maintain the code?". Even I you can bite off the very complex hand-coded implementation of such shares, can you maintain that RTL code? Can one of your colleagues maintain that code? Can you make a change to one of the algorithms and not completely break how the operations are shared with the 2nd algorithm (or the 3rd or 4th)?

With an HLS flow, you don't have to worry about how you modify the sharing in the RTL code. You simply change your top-level C++ code and rerun the HLS tool. It will figure out the sharing that is possible for you each time you make a modification.

This notion of "what optimizations can I get AND still have modular, maintainable source code" is often one of the major determining factors in how a design gets implemented. In a hand-coded RTL flow, the maintainability issue often prevents sharing of whole algorithms. With HLS, the tool doesn't care about that complexity. It just gives you the optimal result each time.

To learn more about an HLS flow that supports these kinds of optimizations, click here.