Get email delivery of the Cadence blog featured here
Whenever we talk to potential customers about Stratus HLS, we usually mention that many users get better quality of results (QoR) with a Stratus HLS flow than they did simply writing RTL by hand. Often, we are greeted with looks of skepticism or downright disbelief. We are usually asked the honest question:
Can Stratus HLS really achieve better QoR than we can get using traditional RTL coding?
I assure you that it can. The remainder of this article is devoted to explaining why and how.
In general, the reasons that an HLS design flow can often achieve better results than a hand-coded RTL flow fall into 2 categories:
In this article, we will concentrate on item (2) and save (1) for a future date.
The first major QoR benefit you can get from writing highly abstract code and implementing RTL with Stratus HLS is that you can explore a much broader range of possible RTL implementations with the HLS flow. Using a combination of directives, constraints and coding alternatives, you can automatically produce a wide range of micro-architectures (a big word for an RTL implementation) that all have essentially the same functionality. For example, you might produce different implementations (from the same C++ source code) that use single-port vs. dual-port memories for storage. Another might use registers vs. memories, etc.
Each of these micro-architectures could certainly be created by hand (using vi or emacs), but it is unusual that a designer is given the time to explore such drastic changes to their RTL. It is not a situation where HLS tools are doing something that a designer cannot. The issue is that the designer using a hand-coding flow simply does not have the time to explore the range of alternatives that you can achieve using HLS.
Another very common difference is related to the issue of "time." When a designer is writing RTL code by hand, they often choose an architecture, do the implementation, and perform detailed optimizations like sharing and arranging the critical resources into states. They carefully design the FSM such that the operations (multipliers, muxes, memory accesses etc.) are all arranged to meet timing through logic synthesis.
Then, a late functional change comes in. Very often, this new functionality is hacked into the design in an extremely isolated manner so as to not disturb the code that has already been built and partially verified. There is rarely time for the designer to fully integrate this new functionality into the existing code and redesign things like the sharing of operations. Rather than the nice clean, well-optimized design you started out with, you end up with the new functionality just grafted onto the side of the original design.
In an HLS flow, the C++ code is written in a more abstract manner. The HLS tool will build the FSM for you and will perform detailed analysis of timing, area, and sharing opportunities every time. When late changes come in, you simply implement the new functionality in an integrated manner and re-run the HLS tool. All of the re-analysis and re-optimization that is sacrificed due to schedule pressure in a hand-coded flow is automatically performed by the HLS tool. EVERY TIME. This very often leads to more optimal implementations in the HLS flow. Again, it's not that a hand-coding designer can't do this - they simply don't have time.
Another case that we have seen often in the real world relates to timing analysis of data paths. When writing code by hand, designers will often analyze the timing of the various operations and calculate how many pipeline stages are required. The degree of accuracy of this timing analysis varies and very often the timing of operations (adders, multipliers etc.) is generalized.
When using Stratus HLS, this kind of timing analysis is very systematic. Stratus bases its analysis of each resource on the exact operation that is required. Thus, if the multiplication is 13x7 bits, then the timing analysis is done based on the delay of a 13x7 bit multiplier. If you later decide you need a 14x8 bit multiplier, Stratus HLS will automatically adjust and recalculate path delays. As shown in the diagram above, each of these operations is processed using Genus logic synthesis within Stratus HLS to provide very accurate timing information. Accurate numbers are extracted from this flow (as opposed to rough estimates) and those numbers are always up to date and change along with your design.
The accuracy of this data is the foundation of Stratus’ ability to analyze every timing path in the design. In one recent case, the hand-coded RTL had 4 pipeline stages, but Stratus HLS calculated that it could be done in 3. The elimination of an entire pipeline stage reduced registers and saved area. Clearly, the RTL designer could have done this detailed timing analysis, but they simply did not have time. It's not a question of what could be done, it's a question of what is actually done in practice.
The combination of exploration and systematic (i.e., every time) analysis and optimization based on really accurate data, allows Stratus HLS to produce designs with very competitive QoR.
The question posed at the beginning of this article of whether Stratus HLS can achieve better QoR designers using a traditional RTL coding flow is, in fact, the wrong question. A better question is:
"Can a design flow using Stratus HLS achieve better QoR than a designer using traditional RTL coding IN TWO WEEKS?"
Time is a critical factor in comparing HLS and hand-coded RTL flows. It's not that designers cannot do what Stratus HLS does; they simply don't have time to do it.
Next time, we'll investigate optimizations techniques that Stratus HLS uses that are either extremely difficult or practically impossible for the hand-coded RTL designer.
To learn more about how schedule affects your PPA (Power, Performance, Area) check out this blog post!