Scaling EDA in the Cloud

28 Mar 2019 • 4 minute read

Last year at DAC, we announced Cadence Cloud (for details see my post cleverly titled Cadence Cloud). Of course, one aspect of the cloud is that it allows you to have as much of everything as you need—if you want 100 SystemVerilog simulations or to do library characterization at dozens of corners, you can bring a lot of compute-power to bear fairly simply. But the real promise of the cloud is to bring a lot of compute-power to bear on a single big task. Writing EDA tools for this environment is not straightforward. In particular, you can't usually just take the code written for a single workstation and immediately have it scale up to lots of servers. There are a number of reasons for this.

Some tasks can be scaled fairly easily. For example, consider design rule checking (DRC). There are a number of obvious ways to use lots of servers. On is to check different rules on different servers, since many rules (is there a metal0 spacing violation?) are independent of others (is there a metal1 spacing violation?). Another is to divide the chip up into different tiles and check them independently. This requires a lot of care when handling the edges of the tiles where they overlap, but the fact that design rules are inherently local means that the overlap doesn't need to be all that large. Circuit extraction is similar: we worry about the capacitance between a conductor and other conductors in the vicinity, but not about a conductor halfway across the chip.

Causality

Other EDA tasks are more difficult to scale to the cloud because they have something global that means that all the "tiles" cannot be completely independent. For example, in simulation, the time (or the clock) has to be agreed among all tasks. There is also less inherent locality than in a DRC, since signals really do get transmitted from one side of the chip to the other.

When I worked at VaST and Virtutech, we would simulate in parallel very large systems connected by networks like Ethernet, with each node being simulated on a different server. We had some flexibility since Ethernet is not cycle-accurate at the level of megahertz clocks, so this made the approach work. When one part of the simulation is running ahead of another, and then sends a message, it appears either as coming from the future or from the past. Coming from the future is not a problem since the message can simply be buffered until time catches up and then it is delivered. But from the past can be a problem since, in the worst case, it means that the simulation already done should never have happened (imagine a reset signal being sent, for one obvious case). In the case of simulation of a system connected by Ethernet, it is possible to just ignore the fact that the message is from the past and process it at the current time. The error is just additional network delay, which the system needs to be robust to anyway. The same approach can be taken at the SoC level with multiple clock domains that are only loosely synchronized.

But ignoring it doesn't work when simulating a big digital system (or a big block) with a single clock, and dividing it up into sub-simulations. A signal coming from the past can violate causality. This is a term stolen from special relativity where a similar problem can arise with observers and signals limited by the speed of light. It is similar to the problems in stories involving time-travel, where events in the past can mix things up—what if you go back into the past and then cause an event such that you were never born in the first place. The Back to the Future movies did a good job of playing with this.

There are other techniques that can be used to minimize causality issues. For example, in mixed-signal simulation, it is not efficient to run the circuit simulator for every tiny time increment of the digital simulation. A better approach is to speculatively simulate a longer time interval, and on the rare occasions when this turns out to be a mistake—for example, the digital simulation changes one of the inputs to the analog block—roll back the simulation and redo it. This works well if rollback happens rarely. It is similar, in some ways, to how an incorrect branch prediction is handled in a modern microprocessor.

Amdahl's Law

A more fundamental limit was noted by Gene Amdahl when he was designing mainframes with huge vector processing units that could perform a lot of work in parallel, such as unrolling a loop and running multiple iterations simultaneously. Gene noted that no matter how fast and how much capacity the vector units had, there was some code that could not be parallelized. His takeaway was that sequential performance in his computers was also really important.

His observation is now known as Amdahl's Law. There's a formula associated with it, but in the context of the cloud, we can see that if there is some percentage, say 10%, of the computation that cannot be scaled out to lots of processors, then this remaining percentage is the ultimate limit on performance. If the 90% of the computation runs on lots of cloud servers and so takes less time than that remaining 10%, then the maximum speedup possible is 10X, because you still have to run that 10% which will take 10% of the original time, giving the 10X speedup.

Building for the Cloud

In practice, these two limitations, global values that have to be shared, and Amdahl's Law, mean that to get tools that truly scale into the cloud requires them to be built for massive parallelism, not re-engineered after the fact.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.