Genus and Innovus: Compus and iSpatial

17 Apr 2019 • 5 minute read

Yesterday I covered the first part of Chuck Alpert's presentation on the upcoming-any-day-now release of Genus (19.1 i you're counting). Today I'll dig into the details a bit more.

Compus

In the new release, there is a next-generation compiler called Compus (pronounced like compass, my favorite extinct EDA company). This very aggressively flattens the levels of logic. As an example, Chuck had a huge priority encoder might end up with 92,000 instances at elaboration, but can be optimized down to 29,000.

Compus aggressively uses multiple CPUs to try different architectures and so guides the synthesis process to the correct microarchitecture, especially opt. This is analogous to what the user might do in prior generations of synthesis, manually getting the tool to try different ideas and picking the one that looked the most promising for doing all the detailed optimization.

The engine is driven by TNS over the whole design. TNS sensitivity tends to be high for datapath elements due to the balanced structure and big bus widths (if you have negative slack on some stage of a 64-bit datapath, you probably have 64 times the negative slack in total). But you can't fix one path at a time in a datapath. The example Chuck talked about was picking a carry-lookahead adder versus a ripple-carry adder. This is something that can only be done early in the synthesis process, and no amount of optimization at the netlist level is going to turn a ripple-carry adder into a carry-lookahead adder one bit at a time to make timing. Timing is not considered a hard constraint initially. The algorithm is biased more toward timing when it is negative (need to work harder to make timing) but towards area when it is positive (there is headroom for a slower design).

The above table shows the results of Compus (comparing Genus 18.1 with the new 19.1). All the designs are improved, sometimes significantly (well, okay, one is nearly 1% worse, and I can tell you that having run engineering in a synthesis company, that design that is worse will always be the one from your most important customer).

Critical Region Restructuring

Here's how synthesis works under the hood. The RTL is read in, turned into a control-dataflow graph (don't worry if you don't know what that is), this is then turned into generic gates. These are often nowhere close to anything that actually exists in the library, such as a 64-bit AND gate. This then goes through structuring and mapping, to get to actual library elements. Then placement and optimization is done, where sizing and buffering are the only optimizations (so a NAND gate might be switched for one with higher drive strength, or additional buffers might be added to a critical path).

The old way of handling a path that fails timing after physical design is to go back to RTL, over-constraint the failing path (so that synthesis produces a netlist with headroom on that path), then redo placement and hope it makes the real timing (and hope that nothing else bad happens, which sometimes it does).

Critical Region Restructuring (CRR) is further breaks down the wall between synthesis and placement. Instead of having to finish go all the way back to RTL, a group of gates (a region) can be aggressively restructured. Genus goes back to the generic gates, restructures in the post-placement timing environment, and then does incremental placement, sizing, and buffering.

CRR significantly improves TNS of difficult PPA designs. It also improves the run time even for many designs that met timing before. The diagram below shows a simple example, where (on the left) a generic AND gate is mapped, but one of the input signals is arriving late. This is not something that can be handled with resizing and buffering, but the region can be restructured (on the right) so that the late signal gets a special fast lane through the region so that it makes timing.

iSpatial: Together at Last

The final step of the "together at last" flow is to let CRR into Innovus. When Innovus cannot make timing, instead of limiting it to buffering and resizing, the synthesis engine of Genus can be used to do CRR during physical design. Since all the engines are common, there is no need for Innovus to do a placement from scratch, it can start from the Genus placement. Of course, there are still things that might happen later during the co-optimization of the design with the clock tree in CCOpt (clock concurrent optimization) but CRR is still available under the hood to address those issues.

This makes a big difference to both results (more designs meet timing) and to runtime (even designs which met timing before now get there faster). The graph on the left below shows the difference between using the old PlaceOpt Design flow (in green) and iSpatial (in blue). iSpatial converges to the required timing much faster. The bar graph on the right shows just how much faster, an average of about 3X. The blue bars are iSpatial and the pink bars are full PlaceOpt.

The quality of results are better too, as shown in the table below:

Summary

This partial merging of the capabilities of Genus and Innovus, with optimized placement in Genus, and restructuring of logic in Innovus, brings us close to the holy grail of the digital flow. The fundamental problem is to get accurate view during synthesis of what the final result will be after running place and route...without having to actually run place and route.

Synthesis always talks about QoR, for quality of results, but in fact, all that counts in the end is what the quality of results is after place and route. It is easy to produce results which look good when they come out of synthesis, only to find out that the netlist is unroutable or has some other major problem. A high clock-frequency in Genus counts for nothing if Innovus cannot deliver it. On the other hand, no design group wants to leave performance on the table by pessimistically guard-banding everything during synthesis just to make life easier for the physical design team.

As Chuck said at one point in his presentation:

if you define hundreds of regions, then that's the placement that you're going to get. Much better to give the tools the freedom it needs to optimize.

There's more in the new release, in particular deeper integration with Joules for power-aware synthesis. But since I've already used two posts to cover Chuck's presentation (and this is the third presentation on synthesis this week), we'll leave power for another time.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email