Optimizing CPU Time, TAT, and Disk Space using Cadence Xcelium Advanced Technologies for DFT Verification

8 Feb 2022 • 8 minute read

Design for Testability (DFT) simulation is crucial to the SOC design process: rapid turnaround time (TAT) allows for faster pattern verification for each netlist release and more iteration cycles during development, while efficient disk space and compute farm utilization enable a greater number of patterns to be simulated with each iteration cycle. In the recent CadenceLIVE Americas 2021, Rajith Radhakrishnan, a DFT engineer at Texas Instruments (TI) talks about the DFT challenges associated with simulation and how the Cadence EDA tools not only helped them overcome the challenges but also up to a 38% reduction in turnaround time from netlist release and up to a 96% reduction in disk space required to store simulation build—all leading to 4,289 days (about 11 and a half years) of CPU time saved.

Introduction

Executing DFT simulation with a quick turnaround time is crucial to the system on chip (SoC) design process. The design teams involved from the specification to tapeout will have to go through multiple netlist iterations, and all these netlists must be simulated to ensure smoother silicon bring up. Here, to address these challenges, additional Xcelium features, namely Multiple Snapshot Incremental Elaborate (MSIE) and simulation save and restore, were incorporated into the DFT simulation flow.

Challenges

Zero defect quality, rapid TAT, and more iteration cycles are essential, especially for automotive designs. It is challenging to achieve/maintain DFT simulation performance due to larger designs resulting in increased test patterns, higher test coverages and multiple fault models (stuck-at (0/1), transition, bridge, IDDQ, RAM sequential, etc.) required to meet automotive quality standards (99% Stuck at coverage). To address these challenges, the inclusion of MSIE and simulation save and restore DFT simulation flow in recent automotive designs by TI resulted in optimizing performance. To demonstrate the effectiveness of various optimizations, here 2 SoCs (SoC A &B) with 5.7M and 19.7M flops and 980 and 1859 patterns will be considered.

Original Flow

The original flow is as shown below in Figure 1, here the gate-level netlist is run through ATPG (Automatic Test Pattern Generator) tool (Modus), to generate test description language (TDL) patterns required to generate the test bench using SIMOUT. The test bench, SDF (Standard Delay Format), and Netlist are together moved to the simulation build using Xcelium to generate the simulation snapshot, which is used to simulate various steps such as test mode entry, scanflush, and ATPG patterns.

Overall, the average turnaround time for SoC A was 34 hours. Here in SoC A, the steps mentioned in flow are running sequentially, like simulation build cannot start until the Modus starts generating patterns and simulation run cannot begin until simulation snapshot completes. The number of snapshots required to run the simulation is another constraint in the original flow as mentioned above, we will highlight this issue using SoC B.

Any TDL which has a different signal list will require a unique testbench (build snapshot). In this context, let us consider SoC B which is more complex with more build snapshots (changes to testbench will require entirely new and different snapshot). So, in the case of SoC B, there are 250+ test benches which equate to 250+ simulation snapshots. As mentioned in Figure 2, it takes ~40 hours (about 1 and a half days) and ~ 500 GB of memory space to build each snapshot, with 292+ runs that were required to perfect the design leading to nearly 500 days and 17 TB of total Disk Space. More details can be found in Figure 2.

Build Optimization using Xcelium Parallel and Incremental Build

The parallel and incremental build capability in Cadence Xcelium allows building the core and test benches separately in multiple snapshots. This essentially breaks down the simulation build into two parts: primary and incremented or elaborated build. The core build (netlist + SDF) is saved as a primary snapshot and the testbench is later attached and incrementally elaborated (yellow block in Figure 3(a)). This approach gives an advantage of eliminating the dependency on testbench as shown in Figure 3(b) In the optimization phase, the testbench dependency has been removed and the simulation-built step and Modus ATPG test generation can now run in parallel as shown below in figure 3.

So, this approach will have the build process eliminated with a primary build and Modus can be run in parallel, thereby helping TI to reduce the number of builds to be saved to the disk—only 4 primary builds (3 corners + 1 power) as compared to ~250 for SoC B and from 127 to 4 for SoCA.

Build Optimization: Hierarchical Reference File (HREF)

Comparing SoC A data between original and parallel and incremental build shows that the memory was exponentially blown to more than double the capacity, often exceeding the available memory in the LSF machine (750 GB). Hierarchical reference file (HREF)used during compile to expose instances for incremental elaboration is the main reason behind the spike in memory requirements. A natively available <reg> switch was used to expose all the flops (only the ‘dff’, and not pins used in parallel test benches). Since this did not resolve the issue, <wire> switch was used to expose all the nets in the design which further resulted in a significant increase in memory and disk space usage. Figure 4 illustrates the exact issue the TI team faced while using the <reg> switch.

In the LHS (Left Hand Side) of the image, the register primitive (which is inside the flop in the image on the RHS (Right Hand Side) will expose only the primitive and not the entire flop on the right The parallel testbench forces the scan_in pin and strobes on the scan_out pin did not have access the pins and so they had to resort to the <wire> switch. If a flop or a path were missed, the primary build would still work perfectly fine, but during the incremental elaboration step, the build will certainly fail as it will not be able to access these ports.

What are the potential solutions to overcome this problem?

Listed below are some of the potential solutions that the team devised:

Run Modus, generate patterns and extract all the signals from the pattern file, but it may further result in increasing the dependency on the test bench and will not aid in reducing the TAT
Obtain a list of flops from the PD (Physical Design) team and use this to generate the HREF file.

This was an issue since the path syntax between the tools used by the PD and Xcelium teams did not match. In addition, with every netlist release, this file had to be updated.

Use Modus to generate the flop list. This requires the completion of the Modus build_model, which can differentiate between buffers and flops. They wrote a simple script that grabbed all the flops from the design and used the Modus get_openaccess_name command to convert and match the Modus and Xcelium syntax.

This approach helped the designers achieve the benefits of parallel and incremental build and complete the build snapshot process before Modus even completed

The recent version of Xcelium 20.06.001+ introduces module based HREF, which allows you to provide the flop library call name instead of a list of exact flop paths. The tool will look for all instances of the flop and expose it. This approach enabled them to not only generate an HREF file at the start of the design but also enabled them to reuse the same file throughout the design process since the library cells do not change (Ideally).

This fix resulted in memory and runtime improvements, memory and disk space requirement reduced by 50% both in primary and incremental build steps while not compromising the average run time (SoCA). Also, simulation time was reduced from 26 hours to 17 hours and memory space requirements were also reduced by 45%.

The total turnaround time for SoC A significantly improved from 34 hours to 27.5 hours on average.

Simulation Optimization using SAVE/Restore

The DFT TDL consists of three parts:

Test mode entry (TME)
Scan flush
Core patterns

15 minutes is spent on opening the snapshot at stuck AT simulation and four hours on test mode entry, 30 minutes on the core pattern, and five hours on simulating the core pattern file. In TFT, the run time to simulate the core pattern file increases to 13 hours and this is due to the PLLs pulsing and higher amounts of switching activity.

The Save/Restore capability in Xcelium enables users to save snapshots of all the processes at a certain time in the simulation. The Test Mode Entry (TME), common for all partitions, can be run once and is an ideal candidate to use the save/restore feature. The snapshot can be saved and used for the subsequent simulations—significantly reducing the debug iteration time since the TME time is skipped—thus saving ~4 hours for stuck-at simulation and ~13 hours for TFT restore. The optimization sequence for SoC A wen run using original approach versus save/restore—clearly showing significant reduction in turnaround time (of 11.5 hours) and enabling faster re-simulation with extra probe/waveform dumps to assist in debugging, especially before tape out.

Conclusion

DFT simulation execution with quick TAT is crucial to the SoC design process. advanced Xcelium capabilities such as Parallel and Incremental Build and save - restore, were deployed into the DFT simulation flow. Multi-run Parallel and Incremental Build allow users to control elaboration partitioning for significant storage and runtime savings. More importantly, it allows tighter control over re-use and environment consistency.

The results from two recent automotive designs achieved up to a 38% and 35 % reduction in TAT up to a 90% and 96% reduction in disk space required to store simulation build, and considerable (3350 SoC B) reduction in CPU days.