The EDA industry is gearing up for what may be its largest retooling ever – retrofitting or rewriting applications to run on next-generation multicore platforms. An inside look at how Cadence ported the Encounter Digital Implementation System (EDI) to parallel processing illustrates some of the challenges, solutions and benefits.
In December 2008 Cadence introduced EDI 8.1, an IC implementation suite that offers parallel processing across the design flow. Tom Spyrou, distinguished engineer at Cadence, led the effort to parallelize the Encounter tools and is still working on that today.
First, the good news. Nearly all the tools within EDI are multi-threaded or “super-threaded.” (A super-threaded tool runs on multiple workstations, each of which may have multiple CPUs). This includes RTL synthesis, placement, routing, timing analysis, signal-integrity analysis, metal fill, and design rule checking. A fine-grained partitioning saves the user from having to partition anything manually.
But not quite everything is multi-threaded. At this time, the Encounter R&D team is still working on parallelizing floorplanning and physical optimization. Much of this will show up in a release later this year. Optimization takes up about 50 percent of the flow, so it’s an important piece.
The bad news in all this is Amdahl’s Law, which imposes kind of a speed limit on parallel processing systems. It says that the overall speed of an application is limited by the portions that aren’t parallelized. Thus, if 90 percent of a program is parallelized but 10 percent is not, you get at most a 10X speedup.
EDI is a suite comprised of many different tools. While individual tools run significantly faster when multi-threaded, the full flow from netlist to routed design was about 23 percent faster on 4 CPUs in the initial 8.1 release. Tom’s goal is a 2X speedup for the full flow on 4 CPUs by the end of the year. That’s about as good as it gets for full EDA flows right now, he said.
What does it take to port a large CAD application to multicore? At a panel I moderated in 2007, analyst Gary Smith said it might take three years (that panel also included Gene Amdahl, who remarked that he never intended to formulate a “law”). This may not be far off. Tom has been working on the Encounter suite since late 2006, although with a small team of people.
Fortunately, Tom said, EDI was more of a retrofit than a rewrite. But things will change when we get into the “manycore” realm of more than 16 to 32 CPUs. One problem is Amdahl’s Law – with that many CPUs, you’d better parallelize 95-98 percent of your application. Another is that each processor will have its own cache, forcing programmers to “micro-manage” caches and avoid bottlenecks between main memory and cache. Tom’s conclusion about manycore: “It’s a rewrite. There are no clever processing techniques that get you there.”
Some clever processing techniques were used for EDI, however. One of the programming challenges had to do with legacy non-thread-safe code. To cope with this, Tom’s team deployed “lightweight” (meaning low memory) multiple processes. This took some memory optimization work, but it works just as well as pure multi-threading, he said.
Debugging race conditions was a big challenge for Tom’s team. “There is so much going on at the same time that you have to program for debuggability,” he said. Fault tolerance was another issue – what if a machine goes down or hangs?
Porting to multicore “is an art more than a science,” Tom said. “The first step is a detailed understanding of how your legacy code works. Look at places where a lot of CPU time is taken, and focus on parallelizing that part. You want the most parallelization for the least pertubation of the code. It’s an iterative brainstorming process.”
My take: This work may be more important than we think. We have some of the most complex software in the world right here in the EDA industry. If we can make it run well on multicore and manycore platforms, that bodes well for a multicore future. If not – then we’ll have to ask who, if anyone, will actually be able to program these platforms.
In part two, we’ll look at how parallel processing was applied to a different set of challenges in the Cadence Virtuoso Accelerated Parallel Simulator.
I agree that it is the time spent in code not the number of lines. If the post read otherwise it was not intentional. The timing engine is definately a key piece that we are focused on.
I also agree that over time the tough algorithmic pieces may begin to dominate the runtime more than they do for today's designs. Right now we are focused on driving the engineering team to an easily measurable goal : for a given design how well does it scale with more cpus.
It is important to remember that Amdahl’s law speaks to the percent of time spent in code, not the number of lines of code. So, if 90% of the time is spent in 100 lines out of a 1 million line of code program, then you only need to parallelize those 100 lines to get up to a 10X speedup. In the case of chip optimization (referenced in the blog), the “magic” spot is the timing engine; it should be both incremental and parallel.
It is also important to remember that the runtime of EDA tools is a function of the size of the design; and the size of the design is doubling every 2 years! So, a tool that spends 90% of its time in one block of code for a given design will spend 95% of its time in that code 2 years from now and 96.5% of its time in that code 4 years from now! When looked at from this point of view, Amdahl’s law isn’t such a limiting factor.