Simulation in the Cloud

6 Aug 2018 • 5 minute read

Yes, despite the blue skies, Fridays are still cloudy. And, yes, this isn't Friday. Due to a migraine out of the blue, Friday got postponed to Monday.

When I was at CDNLive Japan, Craig Johnson ran a demonstration of using Orchestrator to launch characterization of a few standard-cells on 40 cores in an AWS cloud-datacenter somewhere in Japan using Liberate. He also gave us a sneak preview of using Virtuoso interactively in the cloud to edit some schematics. The other tool that Orchestrator already supports is the Xcelium simulator. This is also supported fully with the Cloud Passport (where you manage the whole design environment yourself) and Cloud-Hosted Design Solution (where Cadence manages it for you). Closely related to simulation is emulation, and that is provided by Palladium Cloud.

To see more about Craig's CDNLive Japan presentation, see my post CDNLive Japan: 対応ポートフォリオ. For more on Palladium Cloud see my post, inventively titled, err... Palladium Cloud. All Friday Breakfast Bytes posts going back to early June have been about Cadence Cloud (sneakily, even the ones before we announced it).

Cloud Simulation

There are a few different aspects to using the cloud for verification. It depends on the job load you want to run.

The first is that you have an almost unbounded number of cores. This is great if you are doing something like characterizing a library, that today might involve 40,000 mostly short simulations. All those cells. All those corners. You probably don't want to go as far as spinning up 40,000 cores to get it all done almost instantly because of the overhead of starting and stopping them. But a thousand might be good.

A second issue is with the simulations at the other end of the scale. The ones that run for a loooooong time. Because of the shared timebase, simulation of a big design is one of the things that is hard to parallelize without the synchronization of time killing you. Whereas the library characterization, with thousands of independent jobs, is obviously straightforward. Xcelium, however, with the Rocketick technology we acquired a year or two ago, can make use of servers with lots of cores and memory, and in the cloud, you don't have to wait to get one like typically is the case for on-prem datacenters. It needs those cores to be on the same server, so a top-of-the-line server. Of course even in the cloud, this is a costly server, but you only pay for it when you need it, whereas in your own datacenter you pay for it even if you are just running the library characterization jobs or something else that doesn't need that deluxe server.

So, whether you want to run a humongous number of small jobs or a few big ones, the cloud can provide the compute fabric on-demand. However, it is not quite that simple. In both cases, you need to keep track of all the jobs, which ones ran successfully for a start. But at a finer level of granularity, you need to know what code was covered. Today, verification is not synonymous with simulation and so it is important to avoid using simulation to check something that was already covered by formal verification or emulation. It is especially important to avoid trying to simulate some code that formal verification has already proved is unreachable. Everything needs to be pulled together.

vManager

That is where vManager comes in. It can either launch jobs directly and manage the cohort of servers in the cloud, or it can interface to job-schedulers like LSF. The advantage of the second approach is that LSF may well be being used not just to balance jobs on your project, but other projects and even non-EDA demand that shares the same compute resources.

Liberate Trio goes to another level, for analog designs primarily. It works out which of the characterization jobs should be run, and runs them, skipping the ones that would add no new in\formation to the database.

vManager also tracks what goes on in emulation on Palladium and Protiu, so although that isn't the focus of today, Palladium Cloud can run some of the jobs, typically the ones that involve a huge number of cycles due to having to boot up, say, Android, and then run a software benchmark.

Other Cloud-Readiness

Simulation involves more than just actually running the simulation. The RTL needs to be compiled. Waveforms need to be saved. Problems have to be debugged. All these are also cloud-ready with Xcelium, ablt to take advantage of multiple cores.

Saving and restoring the simulation quickly is a very important capability. It allows the early part of the simulation, such as initialization or OS boot, to be run once and then saved. Then this savepoint can be used in a couple of different ways. Either a difficult problem can be examined repeatedly to debug an issue, without needing tor re-run the initialization. Or mutliple runs can be launched from the savepoint without the need for each run to re-execute the initialization or boot sequence. The above diagram shows this more clearly.

The same goes for Palladium Cloud, where the actual emulation is done by the Palladium hardware. But it is still necessary to compile the RTL, access signals and waveforms. The attractions of high-speed emulation are a lot less if everything else involved in doing the verification is not also fast.

Advantages

I think the advantages, except perhaps the economic ones, are obvious. If you can have as many machines as you want whenever you want, that is better than the alternative. Way better than the really old days when I started my career and all you had as an engineer was the machine at your desk (dorm room refrigerator size if you go back enough). For a small company, the cloud makes it unnecessary to ramp up large, expensive datacenters and IT organizations. For a larger company, who already has on-premise datacenters, the cloud can handle peak loads, in the short term, and perhaps provide a more attractive solution for future growth compared to exanding those datacenters.

I wrote a post recently about how fixed costs are always wrong: too much resource with no revenue to pay for it, or business you can't take because you are out of capacity. See my post Turning Fixed Costs into Variable Costs: Foundries and Clouds.

Conclusions

The above diagram captures the big advantage of cloud-based verification. By having additional capacity available on-demand, the time required to runn the same amount of verification is shortened, and so the tapeout date can be pulled in (or, perhaps more realistically, not risk slipping).

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.