Get email delivery of the Cadence blog featured here
Over the past decade, we have seen a dramatic increase in the size of common video formats. Addressing this has required an evolution in the performance and complexity of video codecs, all the way from MPEG-2 to H.265. When designing hardware for these CODECs, high-level synthesis (HLS) has been a very common implementation tool of choice, due to the huge productivity gains provided by HLS and the ability of designers to easily experiment with multiple architectural choices (“design space exploration”).
The HLS team here at Cadence® has watched customers build these designs all over the world, and we have the passport stamps to prove it. We have seen multiple approaches to implementing various CODECs, and I wanted to describe some of the experiences we have seen and hopefully provide some useful advice should you be looking to start such a design.
The diagram below shows a very high-level block diagram of an H.265 design.
Each of the blocks in the diagram is a processing step, most of which represent a digital signal processing algorithm of significant complexity. These algorithms are well-suited to being described in C++ and implemented using HLS. Teams using HLS for these designs typically maintain a partitioning similar to what is seen in the diagram, breaking up the overall algorithm into a set of separately-synthesized modules. Since these design projects generally require deployment of multiple engineers, this kind of decomposition is very useful from both a conceptual point of view as well as a management point of view, since you need to effectively partition the work amongst the team.
So - let's discuss 3 key lessons we have learned over the past decade in this design space:
Build the system model first and build it in a manner that maintains a high-level of abstraction. Keeping the models at a high level of abstraction will maximize the flexibility of the IP as well as maximize the opportunity for architectural exploration. Using interfaces, both internally between blocks and with the outside world, that are both flexible and fast is also key to the efficiency of the design and verification of the project.
To support this activity, the Stratus HLS team has developed a library of interface IP that is broad and flexible. You can simulate your design using these interfaces in "transaction mode" (commonly called a TLM simulation) and then, at the push of a button, switch over to a simulation that runs in "cycle-accurate-interface mode,” where the internals of the blocks are simulated in abstract C++ mode, but the interfaces are cycle-accurate (CA). This CA mode will simulate more slowly than TLM mode, but it is still much faster than RTL simulation and provides a more complete verification of the interaction of the modules in the design. The polymorphic nature of these interfaces has very high value in real projects and should be used wherever possible.
This kind of organization of your design allows you to very quickly validate your algorithms using fast transaction-level simulation. In addition it allows you to verify the interface protocols in the design using behavioral SystemC which is far faster than RTL simulation. This is superior to a simplistic model where your code is organized as a set of function calls. With this latter flow, the first time you will ever simulate your blocks interacting together through their hardware interfaces will be in RTL. This creates many problems and challenges getting a fully functional RTL model of your design. In addition, if you want to explore the design space and that exploration requires changing interfaces, you have to perform this manual RTL hookup all over again for each interface change.
When it comes to actually performing the HLS step, it's important to be able to perform HLS at both the module level (i.e. each green block in isolation) or at the whole design level (the entire contents of the blue rectangle). Being able to synthesize all the modules and their connectivity allows you to generate an RTL implementation of the design that includes everything you need. There is no need to engage in a very error-prone process of manually hooking up blocks once you generate them.
Having the interfaces explicitly modeled in your C++ model is a superior flow. It allows you to synthesize exactly what you simulate (and verify). Of course, that requires that you choose a flavor of C++ that supports this kind of architecture and the industry-standard flavor of C++ that provides everything required is IEEE 1666 SystemC.
Decomposing the design into modules with clearly defined boundaries and interfaces also allows you to more easily manage your engineering resources. Looking at the block diagram above, imagine that you assigned one designer to each module. Each designer can develop their module in isolation and then assemble the high-level code.
Once the system model is fully functional, the designers can begin running synthesis and optimizing their designs. As they generate RTL, the ideal debug environment is to use an automation system that will simply plug an RTL version of a module back into the abstract SystemC design. Using this technique, each designer can run a full system test where the bulk of the modules are abstract SystemC (and thus simulate very quickly) combined with his or her specific RTL module.
Consider the picture above. When simulating this version of the design, all the green blocks will still be in abstract SystemC, but the orange block will be in Verilog RTL. This is the block being debugged "in place.” You can use exactly the same testbench every time. This kind of project management is very productive, but it does require that you:
Architectural exploration is one of the most fundamental values of applying HLS to CODEC designs. This allows a designer to very easily generate multiple functionally equivalent RTL implementations from a single C++ source implementation. You can vary many different implementation parameters to build multiple micro-architectures.
It is important to build the system model early and to include a testbench that is sufficiently capable. Keeping the coding style for the design modules as abstract as possible will allow you to use high performance system simulations for as many simulation runs as possible. This allows you to run many simulations up front in order to both prove the correctness of your design as well as to test the performance of the design.
Having a testbench that is complete enough to pick up errors is very important at this point, since it will allow you to perform architectural exploration with confidence once you start running HLS and getting RTL results. Imagine being able to trivially trade off storage choices like single vs. dual-port memories vs. register banks! Using HLS to generate RTL and pushing that RTL through your implementation flow gives you highly accurate information to use in evaluating the area / power / performance trade-offs. The presence of a quality testbench is important during this phase so that you can effectively measure the performance of the resulting design as well as ensure its correctness (i.e., you didn't break it during experimentation).
To maximize your productivity, you will also need to have a fully automated flow to run downstream tools to evaluate the results of your exploration efforts. This includes tools that provide you data regarding area, power, timing and congestion. Having all these flows be completely push button allows you to dramatically improve your visibility into implications of design decisions. It also allows you to see what the most likely future is for your design when it progresses through to production (we call this "seeing around corners").
To summarize the key actions you should take when designing a CODEC with HLS and things you should look for in an HLS flow:
To learn more about a flow that supports these actions, click here.