TSMC, Microsoft, Cadence: Signoff in the Cloud

5 Nov 2020 • 7 minute read

As you can guess from the title of this post, TSMC, Cadence, and Microsoft have been working together on signoff in the cloud. Since signoff comes later in the design cycle and is CPU intensive, it is an ideal part of the flow to make use of the cloud. It makes no sense to have in-house provision with enough resources for the peak need of signoff, since those servers will often go idle the rest of the time especially in companies that only create a few SoC designs per year.

There is a TSMC white paper on this topic. It was presented at CadenceLIVE Americas. And, on December 8, there will be a webinar on the topic. Details and registration links at the end of this post.

You might expect a blog post on signoff to discuss process corners, or voltage-based timing, or variation. But the topic of the white paper (and this post) is more on the practicalities of using cloud scalability. Cloud is always pay-per-use. But there are smart ways to do this to both reduce the number of virtual machines (VMs) while still speeding the timing signoff substantially.

There is a tradeoff between the amount (and cost) of computing power and the run time. If you are extravagant, you can consume a lot of computing power for a minimal decrease in run time. If you are too parsimonious, your run time will increase dramatically. In between, there is a sweet spot. One of the challenges with the cloud is to make sure whatever machines you pick are being fully utilized. A trivial example is that if you use a VM with a huge amount of memory but the design is not that large, then you are going to waste memory (and the money you pay to have it available).

Working with Cadence and Microsoft, TSMC has come up with two particular strategies that they call "scale-out" and "scale-in".

Strategy 1: Scale-Out

This strategy is to use more cloud VMs to reduce the number of runs that end up being sequential due to not having enough servers to run them all in parallel. People tend to think that using a lot more VMs in the cloud will increase cost substantially, but that is not necessarily the case. In the diagram below, if the total runtime speedup ratio A is larger than the VM count increase ratio B, designers will enjoy both cost saving and run-time reduction at the same time.

Strategy 2: Scale-In

This strategy incorporates more jobs into one VM to reduce the total number of cloud VMs required. The full-blown scaleout strategy already discussed helps speed up turnaround time to the extreme, but it may not be the most cost-effective. The scale-in strategy guides designers on how to make good decisions on balancing run-time speedup against cost. In the diagram below, if the VM count decrease ratio C is larger than the run-time increase ratio D, designers will have even further cost saving while still keeping turnaround time relatively short.

A key consideration while using the scale-in strategy is the peak memory footprint required by each EDA job. Total cost of cloud runs is decided by two factors: the number of VMs used and the run time. If you become overly aggressive squeezing too many jobs into one VM, and end up exceeding total memory available, that will trigger the EDA tools to resort to memory swapping and slow down the run time excessively. In that situation, you not only sacrifice possible run-time speedup but also spend more money than needed even if using fewer VMs. Therefore, peak memory per job now becomes a primary consideration for designers to pay special attention in order to strike the right balance when doing timing signoff in the cloud.

Tempus and Quantus Results

To show how these two strategies work, and how to balance them, let's look at a sample design. This is a 37 million instance design in TSMC N5 technology with 150 timing views and 12 extraction corners. The following are key highlights from the successful execution of the scale-out and scale-in strategies:

Tempus Timing Signoff Solution – For maximum turnaround-time reduction (scale-out), we leveraged the vast resources of the cloud to demonstrate 150 parallel machines executing 32 CPUs each. This massively scale-out approach established our best baseline runtime from which other configurations were compared. It is not possible to go faster.
Tempus Timing Signoff Solution – For maximum efficiency (scale-in), we utilized the full memory available on the cloud machines. This provided a 2X machine cost reduction benefit (over the fully parallelized run number 1) while requiring only 2X more wall time.
Quantus Extraction Solution – For maximum efficiency (scale-in), we demonstrated nearly linear scalability through 64 CPUs, with continued performance improvement through 256 CPUs (scale-out).

It is also worth pointing out that there is sometimes no real advantage in paying extra to get a job run faster if nobody is going to be able to use the results. For example, an overnight run might take 12 hours and the output will be ready at the start of the following day. By paying extra, it might be possible to get the job to finish at 3am, but for practical purposes that might be the same as overnight since the engineering team is sleeping. Of course, sometimes teams make use of timezones to effectively have a team that runs 24-hours per day, in which case somebody will be up and about at 3am.

By moving to the cloud, the user can experience a significant improvement in productivity, no longer constrained by on-premises hardware limitations. By strategically applying the scale-in and scale-out strategies, users will accelerate their schedules while maintaining a high return-on-investment in cloud hardware expense. The Cloudburst Platform built on top of, and integrated with, Azure provided the connectivity model, the user interface, job control, security model, and file transfer capabilities.

Some Details on the VMs

The sample design on CloudBurst was handled by two classes of machine:

"low grade" E48ds_v4, Intel Cascade Lake 8272, 48 vCores, 384GB memory
"high grade" E64ds_v4, Intel Cascade Lake 8272, 64 vCores, 504GB memory

The basic core is the same in both grades, with the same performance. The difference is in the number of cores and the amount of memory.

Tempus Solution

Using Tempus DMMMC (distribute multi-mode multi-corner), it was found we could run three views in parallel on a single low-grade machine, or four views in parallel on a high-grade machine. This was limited by the memory footprint of each view. If all the views cannot run in parallel, the scheduler in the Tempus solution will run more views as machines become available until all the assigned views are completed.

The table below shows the same design being run on different combinations of machines. The runs circled in green are scale-in strategy; in red, scale-out; no color in between. Note that "walltime ratio" is the ratio of the fastest, most scaled-out run (with 150 E64 machines) to the remainder of the runs.

The conclusions, from the white paper, for the Tempus solution are:

Scale-Out Strategy
- Recommend the scale-out strategy to gain maximum throughput and minimize turnaround time by taking full advantage of massive cloud capacity
- E48 vs. E64 – pick the VM based on memory required by your design and its CPU count sweet spot

Scale-In Strategy
- Recommend the scale-in strategy to parallelize multiple views (local parallelization) on a given machine to maximize available memory for peak efficiency
- Then use global parallelism to scale turnaround time as schedule requires and budget permits
- E48 vs. E64 – the main benefit is available memory thus allowing more parallel views per machine

Quantus Solution

The white paper goes through an analysis of how best to make use of the Quantus solution's multi-corner strategy, which operates very differently from the Tempus solution. I'll skip to the recommendations in the white paper.

Since the Quantus Extraction Solution’s MC extraction operates on a per-core granularity, the user can freely choose the numbers of cores as a desired function wall time vs. cost
If the fastest turnaround time is desired, the scale-out strategy using a high number of CPU cores is recommended
If a balanced approach of turnaround versus cost is desired, the scale-in strategy is recommended

Learn More

Attend the upcoming webinar is on December 8. Details are on the webinar page, including a link for registration.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.