Computational Digital Software

24 Jun 2020 • 6 minute read

Cadence has been using the term "computational software" to unify many of the algorithms that underlie EDA tools. I think that the area where this is clearest is the digital full flow, which starts at synthesis, though physical design, and signoff. Cloud vendors sometimes casually say things like "run for a week" but they seem to be surprised when EDA algorithms literally take a week (or longer), perhaps using 32 or more CPUs during the period. Maybe some machine learning on the largest datasets take similar periods, but the algorithms used have improved a lot. Resnet-50 (a standard neural net trained on Imagenet) only takes 15 minutes, admittedly also on 32 CPUs with 32 GPUs. The ImageNet Large Scale Visual Recognition Challenge contains 1,281,167 images for training, 50,000 for validation, and 100,000 for testing. By EDA standards, these are small numbers.

When you read that a chip has 10s of billions of transistors, it is not quite as challenging as it sounds. Firstly, a gate is usually four transistors, and most standard cells contain multiple gates, so the number of "placeable objects" is an order of magnitude less. Most SoCs contain significant amounts of memory, which are generated by a memory compiler. The biggest chips of all, GPUs and multicore CPUs, have a lot of regularity—a core is created and then reproduced to fill the chip. Nonetheless, the computation required to take an SoC from RTL to complete layout is one of the most intense in all of EDA. We are still dealing with perhaps hundreds of macro blocks and tens (or hundreds) of millions of standard cells.

Digital Full Flow

There are multiple core engines involved in the digital flow. Engines for synthesis, placement, clock-tree, global routing, detailed routing, timing, extraction, power. These all need to be best in class to get a best-in-class result. In the dim and distant past, these engines would run one after the other (synthesis, then placement, then clock tree...and so on). That is not accurate enough for all the physical details that need to be correct in a modern leading-edge process like 16nm and below. There also needs to be world-class integration so that, for example, placement can be taken into account and partially completed during synthesis, or that synthesis can be used during physical design to restructure logic.

Integration between synthesis, placement, and optimization is so important that we call this iSpatial technology. Interconnect layers in a modern process all have very different capacitance and resistance profiles, so deciding how to use them needs to be done early, which involves techniques such as layer assignment, useful clock skew, and via pillars. The iSpatial technology allows a seamless transition from Genus physical synthesis to Innovus implementation using a common user interface and database.

When a design runs for a week, runtime is very important, Saving 10% saves more than a day. A savings like this can result from all sorts of reasons: better scalability into the cloud (be able to use 48 machines), better core algorithms (a faster placer), better integration (say between the placement engine and the timing engine).

Over the last several years, Cadence has created a digital full flow with common integrated engines resulting in technology leadership. Of course, development is never over, but this gives us the platform to go to the future. Advanced node (5nm, 3nm, and beyond) is still very important to our most advanced customers. Another trend is advanced packaging, often called More than Moore. Integrating everything onto the most advanced node is often not the most cost-effective approach, and it makes a lot of sense to use the most advanced node only for the heart of the SoC that requires it, and then use a less aggressive node to interface to the outside world. Not only does this approach yield better, but it can also result in faster time-to-market since the current chip can use the I/O designs from the previous process generation, and the current generation I/Os can be developed off the critical path to be ready for the next-generation chip, perhaps a year later.

Machine Learning

Another trend is to use machine learning (ML) to further improve. One of the limiting factors in any advanced SoC design is the availability of enough designers. In some ways, running EDA tools has something in common with running a nuclear power plant: a lot of it is very repetitive, but you need absolutely the best engineers. ML allows computation, especially in the cloud, to substitute for routine human interaction and thus a major increase in productivity. In much the same way as a goal for cars is autonomous driving, but we have to get there through incremental steps, the goal for the digital full flow is eventual automation, sometimes called "no human in the loop". But, as with cars, we have to approach through incremental steps. Perhaps we can call that "fewer humans in the loop". When an engineer is running a tool, looking at the result, and then tweaking some parameters before running the flow again, there are a lot of opportunities to tweak the parameters automatically, at least some of the time.

There is also scope for improving predictability. At some level, each tool in the flow has a standard of "goodness" that is tied up in how well it integrates with the next stage. A good placement is one that the global router handles well. A good global route is one that the detail route can handle well, and so on. Almost all the algorithms in EDA are computationally intractable, in the sense that getting the exact optimal solution is not possible. Instead, heuristics are used. But this is another area where ML can be used, with heuristics at one level using ML to better predict how things will be downstream.

Better prediction leads to both a better result, and, at least potentially, a faster runtime due to the reduction in the amount of iteration required. Diagrams of a flow always look much more linear than they really are, since there is so much iteration going on under the hood. It is like the famous description of a swan as serene on top but paddling furiously underneath. EDA tools often run smoothly from start to finish, but it takes a lot of furious paddling to make everything that smooth.

Summary

The Cadence digital full-flow contains best-in-class engines, together with best-in-class integration to make everything work cleanly together. Adding machine learning improves productivity even more. You need best-in-class digital technology to create and sign off best-in-class results.

Mediatek

One happy user is Dr SA Hwang, the GM of the computing and artificial intelligence technology group at Mediatek. His experience is:

We spend a significant effort tuning our high-performance cores to meet our aggressive performance goals. Using the new ML capabilities in the Innovus Implementation System’s GigaOpt Optimizer, we were able to automatically and quickly train a model of our CPU core, which resulted in an improved maximum frequency along with an 80% reduction in total negative slack. This enabled 2X shorter turnaround time for final signoff design closure.

Having run an engineering group developing synthesis tools, albeit 20 years ago, I have to say that a reduction of 80% in TNS is huge. Obviously, halving runtime is also very significant.

Arm

At Arm TechCon last year there was a joint presentation by Arm, Cadence, and Samsung about implementing a big Arm processor in Samsung's 5nm process. I wrote about that in my post Implementing Arm Hercules with Digital Full Flow.

Computational Software Video

Watch this short (3½ minute) video on computational software for Intelligent System Design.

Learn More

For more information see the Digital Design and Signoff page.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.