TI's Experience Taping out with Pegasus

10 Apr 2019 • 5 minute read

At the recent CDNLive Silicon Valley, Kyle Peavy of Texas Instruments (TI) presented Cadence Pegasus Physical Verification: A Customer Tapeout Experience, along with Cadence's Digo (Dibyendu) Goswami.

Obviously, just based on the title, TI taped out their chip successfully. Kyle wasn't allowed to say how big the chip was, but he did say it had little repeated hierarchy. "It's not huge tiles of CPUs or anything like that." It is a 38GB gzipped GDSII.

Kyle started by emphasizing that during general DRC clean up, most engineers do their optimal work during the day, so a 12-15 hour run time allows for overnight verification of a day's work. DRC is the most common new issue that pops up during chip closure so quick feedback is critical. A lot of signoff tasks can be done in parallel, as shown in the diagram below, in what Kyle calls the "race to closure".

They found that run-time scaled pretty well with the number of CPUs. He did caution that this was running in a live datacenter shared compute environment, so there is some noise in the numbers since things can vary based on what else happens to be running at the time. The charts below are full-chip DRC (on the left) and antenna checks (on the right).

TI never ran the antenna checks on more than 40 CPUs, but Cadence continued to see good scaling when they ran on more. There are updates coming to LVS which should dramatically speed up antenna checks, the bulk of which is spent propagating the connectivity across the whole design (see the second half of this post for the details that Digo shared).

Compute Environment

The reality of life in a shared compute LSF environment is that there are many available hosts with 4 CPUs, and few with 16 CPUs. TI's experience was that the efficiency of Pegasus wasn't impacted as the job was spread across more hosts with fewer CPUs, at least for 4/8/16 CPUs. Even for hosts with just 2 CPUs, the overall runtime was only impacted 10-15%. Another great attribute of Pegasus is that it starts pre-processing the moment it gets the first CPU and then adds other immediately as they come online. In particular, it doesn't wait for hundreds of CPUs to be available before getting started. Pegasus also takes into account the amount of memory on hosts and can handle a mixture of smaller and larger memory hosts.

The interhost communication is direct via socket, and multiple runs on the same disk are fine. However, when TI tried using offsite compute resources together with local compute resources, they got poor results (with many tools, not just Pegasus...network latency is important).

The TI use-model was to look at jobs in two classes:

Background Pegasus runs where the designer is not waiting on the results (such as early floorplan verification or incremental ECOs) are done with a small number of CPUs, reserving compute resources for other tools.
Critical Pegasus runs where the designer is waiting on results (such as manual DRC closure process) are done scaling to a very large number of CPUs

One area where Pegasus has advanced since it was first released is fault tolerance. It is the nature of shared compute farms that machines crash. Pegasus used to crash or hang the whole job when this happened. Now the run will continue. The limitation, for now, is that you lose the output of the tasks that the crashed CPU was working on, and have to manually restart them. In the future (2H19), the crashed jobs will be restarted automatically on healthy hosts.

The progress of Pegasus on the DRC is also handled incrementally, with notification as each rule completes. If there are errors, then they are updated and can be viewed in the Pegasus UI even while the job is still running. Finally, near the end of the run, Pegasus reports not just what has completed but also which rules remain to be checked, so you can see the job coming down the final straight.

Pegasus Under the Hood

Digo Goswami took over with some details of how Pegasus has bee architected.

Underlying Principle: Massively Parallel, Fully Distributed
Pegasus can achieve near linear scalability up to thousands of CPUs
Scalability is achieved through
- Operational Level Parallelism: Divide each “command” into a # of small “operations” and run them concurrently
- Database Level Parallelism: Cell Partitioning and Clustering. Each command is instantiated multiple times to run on each cluster
- Pipelining: Pipelined engines adds extra scalability and reduces memory footprint
“Connect” operation is fully distributed
Data Flow Architecture: No central database

Digo went on to talk about SmartVerify. In the early stage of the design cycle, the layout may be extremely dirty (lots of DRC errors) and running DRC can be prohibitively expensive. SmartVerify can quickly identify gross issues. It also has an understanding of the common errors produced by place & route tools and pinpoints the root cause. It is very fast, under an hour on full chips. For example, on TI's chip: 8 CPUs, 36GB RAM, in 42 minutes. Using SmartVerify to get the design clean, before using full-blown DRC can dramatically reduce the overall cycle time, as in the image below:

Earlier, in his part of the presentation, Kyle had said that using SmartVerify TI found quickly a couple of errors that could have been hard to find:

stray geometry outside bounding box from a new IP handoff
GDSII collisions from a new IP handoff

One time-consuming aspect of LVS and antenna rule checks is that the connectivity of the entire design needs to be analyzed. This is inherently unfriendly to distribution since connect operations are interdependent and the same layer appears in connect commands for multiple layers. Even for regular DRC, 20% of rules in a modern deck require connect (for instance, connected metal might have a different spacing rule from unconnected metal). But 100% of antenna and LVS extraction involves connect since they require full connectivity by definition. Pegasus now has a distributed connect operation that scales across multiple workers, and is then heavily leveraged for LVS and antenna checking, as in the performance graphs below:

Summary

TI successfully taped out a 16nm design using the TSMC Qualified Pegasus Runset. Full chip DRC ran in 10.5 hours on 384 CPUs, which exceeded the goal of simply "overnight." Pegasus used the resources in their existing frame without any need for customization or reserved hosts. The usability features make it easy to monitor progress and assess individual results.

The presentations should be available on the CDNLive Silicon Valley page within a few weeks. For more information on Pegasus, see the Pegasus Verification page.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email