Under the Hood of Genus

25 Jun 2020 • 8 minute read

From time to time people ask how EDA tools work under the hood. I think the question is more along the lines of "what is all that stuff in the front of my car?" more than "how does ignition advance work?"

So here's what's under the hood of Genus (no ignition timing included).

I thought I'd start with Genus Synthesis since I ran engineering for a year at Ambit so I have better working knowledge than with most other tools. But that was over twenty years ago, and quite a bit has changed since then. The cutting edge of what we were working on was called PKS, which stood for "physically knowledgeable synthesis". Up until that point, timing in synthesis had been estimated using wire-load-models. A wire-load-model provided an estimate of the delay on the net based mostly on the fanout of the net itself, and the overall physical size of the block being synthesized. When blocks were small, and most delay was due to capacitance (as opposed to resistance), this worked surprisingly well.

But the obvious thing being left out was where the cells were placed and so how long the various line segments were. When I was presenting PKS to customers, I would use the analogy that wire-load-models were like trying to estimate the flying time to visit a certain number of cities in the US, without knowing which cities they were. When blocks were small, it was like knowing the cities were in the Bay Area, so there was a limit to how far you could go wrong. But as blocks were bigger, it was more like having the whole of North America, and it made a big difference if you were looking at Seattle, Portland, and Vancouver. Or were looking at Seattle, Boston, and Houston. This is not quite the famous traveling salesman problem, since you typically route wth a Steiner tree (an on a chip you are restricted to horizontal and vertical wires),

If you have heard any Cadence presentation about the digital flow, you will have heard the theme that the flow has common engines: a single timing engine, a single extractor, a single placement engine, a single global router, and so on. Timing is only a very rough estimate without placement, good for tuning up the RTL but inadequate when doing physical layout and signoff. When at Ambit we were building PKS, we didn't have a physical design system or even a placer. We had to create one from scratch. But once Cadence acquired Ambit, it made no sense to develop a second placement engine—much better to use the Cadence placement engine within the synthesis tool. And Cadence adopted the Ambit timing engine.

Rather than rely on my 20-year-old knowledge, I talked to Chuck Alpert, who heads up the engineering for the Genus Synthesis product.

Elaboration

Synthesis starts from RTL, of course. These days, usually SystemVerilog, although Verilog and VHDL are still used, mostly for legacy IP. The first step of synthesis is called "elaboration". The RTL is actually partitioned into control and dataflow (known as CDFG, for control/dataflow graph). This is then traversed and the first actual synthesis is done into "generic gates". These are generic, in the sense that at this point the actual process and library that will be used is not considered. These gates are as generic as their name implies: "a 3-input NAND gate" or "a D-type flipflop". Chuck thought that the elaboration was largely from the Ambit synthesis product, although obviously much enhanced.

The reason for the generic gates is that a lot of optimization can be done at this level, and a lot more easily than having to worry about all the idiosyncrasies of the actual gates. If you went through an old-school gate-level design class, which I suspect doesn't exist anymore, then you will have learned how to do some of that optimization in the Quine-McCluskey method (in a little coincidence for this post, McCluskey was on the Technical Advisory Board of Ambit). Genus Synthesis can also do the first phase of clock-gating on the generic netlist.

For designs that make a lot of use of datapaths and arithmetic, a lot of datapath optimization can be done, switching adders to carry-save-adders when performance is required. For more on adders, see the third part of my series on implementing carry, going back to adding machines: Carry: Electronics. There can be a lot of word-level optimization done, sometimes crunching the logic down by as much as 10X.

Next, mux (multiplexor) optimization can be done, especially for state-machines. Old-school state-machine optimization considered the flipflops as very expensive and would try and use the minimum, at the expense of lots of gates that essentially encoded and decoded the state. These days, flipflops are not that much more expensive than gates, so "one-hot" is more often used, with a flop for each state (at least if the number of states is reasonable).

Structuring

Next comes what Chuck calls "traditional academic BDD type synthesis". BDD stands for "binary decision diagram" and is an approach used to optimize at the equation level (a sort of supercharged Quine-McCluskey or Espresso, if you know what either of those are). This is optimization to reduce the Boolean literal count and get a netlist that is close enough that it is worth placing.

Placement

The early physical flow begins with generic placement. This is still using generic gates, but now they are placed so that we start to get more accurate timing. It is obvious once you think about it, but once you mix physical and logical, you have two ways to impact timing of a little piece of the circuit: you can change the details of the gates or you can move the gates.

There is another phase of structuring done, since with the locations of the gates provisionally set by placement, that gives a lot of additional information on how best to structure large elements. For example, if you have a 64-bit NAND gate (to see if all the bits are 0, for example) then how you split that up into smaller gates will depend on where they have ended up.

The next step is called "mapping". Surprisingly, up until this point, the actual standard cells in the library (or libraries) being used didn't get considered. Everything has still been in generic gates. Mapping goes over the generic gates and finds library cells to implement small pieces of the netlist. The generic gates might all have been NAND gates, but perhaps it is better to use a complex AND-OR-INVERT gate if it is available.

Optimization

After some cleanup, Genus Synthesis enters an optimization phase. This can be one of the slowest parts of synthesis. Genus Synthesis "tries to make timing" by looking for paths that are missing timing and then considering an incremental change. If the incremental change improves timing, then it is accepted. If it does not, then it is discarded and a different incremental change is considered. For paths that make timing, Genus Synthesis considers changes that reduce cost (pick smaller cells) without making timing paths go negative.

If you have any experience of synthesis, you probably know the initials TNS and WNS. These stand for "total negative slack" and that is the sum of how much all the paths in the design miss timing. Positive slack means there is time to spare, zero means timing is dead on the constraint, and negative slack means it misses timing. WNS is "worst negative slack" and it is the amount by which the worst path misses. There is a lot of detail to be considered, but those two numbers capture how "good" the synthesis did in just a couple of parameters.

iSpatial

The iSpatial tool blends synthesis (Genus Synthesis) and place and route (Innovus Implementation) so that both tools can take advantage of deeper integration with the other. Actually it blends the placement and optimization of Genus Synthesis with global routing and placement from Innovus Implementation. For a deeper dive into iSpatial, see my two posts on Chuck's CDNLive presentation last year: Genus and Innovus: Together at Last and Genus and Innovus: Compus and iSpatial.

A Deeper Dive

There is a lot of raw computation going on during synthesis, especially if the design is large obviously. Placement is treated as an optimization problem in the GigaPlace engine. To make this mathematically tractable, the timing model needs to be smoothed (so that the process doesn't keep getting stuck in local minima). So the timing model is a much more "fancy mathematical thing" in Chuck's words. The placer looks at the density and the timing, and then iterates to get new timing and density. This is used in iSpatial towards the end of synthesis. Even deeper down this is using matrix solvers, like so much of EDA. When this engine was first built, it was not very fast, but over the years it has been improved by an order of magnitude.

In a modern process, the resistance on the various layers of interconnect is very different. So during generic placement, in the middle of synthesis, it is not enough to just do placement, the signal layers (and other details) need to be assigned to get good timing values. This process is called global routing (since it cuts a lot of corners that a later router—the detailed router—will have to clean up).

Learn More

Well, this is about the internals of a Cadence tool, so you aren't going to learn much more anywhere, except perhaps at conferences like DAC and DATE when we present. But here is the Genus Product Brief that is a reasonably deep dive.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.

"+ res.PreviousPostTitle); // //NextPostUrl // //Previousposturl // } // }); }); if ( $('.blog-post.nextweb-blog-post .ifrmesrc').length ) { iframeattr = $('.blog-post.nextweb-blog-post .ifrmesrc'); markup = ''; $('.blog-post-content .ifrmesrc').html(markup); $('.blog-post.nextweb-blog-post .ifrmesrc').show(); } -->