Get email delivery of the Cadence blog featured here
Presentations started with a
somewhat skeptical presentation by Patrick Groeneveld of Magma Design
Automation, and then moved on to a really, really skeptical presentation by
Patrick Madden of the State University of New York (SUNY) Binghamton. These
were followed by what I'd call a balanced, hands-on presentation from Tom
Spyrou of Cadence. (He has been working to parallelize the Encounter Digital
Implementation system, and I interviewed
him last year about the challenges that entails).
If there was a silent
presence in the room, it was Gene Amdahl, whose "Amdahl's Law" reminds us that
the gains you get from parallelizing code are sharply limited by any sections
that cannot be parallelized. As Patrick Groeneveld pointed out, if you can
parallelize 50 percent of the code, you can only get a maximum 2X theoretical
speedup. The reality is probably closer to 0.8X. Since all multi-tool EDA flows
include some non-parallelizable code, no one should expect a 7X speedup on 8 processors.
Patrick Groeneveld - The "Last Resort"
"Parallelism is the last
thing I look at if I have a problem in my flow," said Patrick, who serves as
Magma's chief technologist. "I will try everything else first because I know
parallelism has some problems." Sometimes, he noted, parallelizing code can
actually cause a slowdown because of the "hidden costs and extra overhead" of
making code parallel.
Patrick identified six
problems he's experienced in trying to parallelize EDA software:
Patrick Madden: Advocates are "Delusional"
Anyone who was a little
depressed from Patrick Groeneveld's presentation was not helped by Patrick
Madden, whose first words were that Groeneveld was "way too optimistic" about
parallel programming. I was not surprised. In a panel (with the real Gene
Amdahl) that I wrote about
a couple of years ago, Madden said that multicore programming is "the end of
the world as far as I'm concerned."
Madden, however, is working
on a problem that should be important to parallel programmers - speeding up the
"hard serial bottlenecks" that cannot be parallelized, and thus give rise to
the limitations of Amdahl's Law. And where do we find such bottlenecks? In
papers and books by parallel programming experts, he suggested. He told of one
Design Automation Conference (DAC) paper that presented a 100-fold speedup with
a parallelized algorithm. It later turned out that a time-worn serial algorithm
actually did a faster job of solving the same problem, with no special hardware
"I would think that when I
pick up a technical paper from a major university, things should be correct,"
Madden said. "Instead there are monumental flaws presenting a new technology
that's orders of magnitude slower than the existing solution." With a slide
including a photo of Bernie Madoff, he suggested that some of the published
parallel processing research is fraudulent.
Audience members reacted,
saying that Madden was "too skeptical," focusing on a small amount of bad
research, and not giving new programming methods a chance to catch up. "If I do
video, a lot of stuff is embarrassingly parallel and that's how it's done," one
said. Madden agreed, but went on to suggest that some parallel programming
advocates are "delusional" and are seeing what they wish to see after taking
grant money to do parallel computing research.
Tom Spyrou: Yes, We Can Parallelize Legacy Code
It may have been a hard act
to follow, but Tom Spyrou spoke next, in a calm presentation that showed how
legacy EDA code can be parallelized within the context of Amdahl's Law. Spyrou focused
on the "fine grained" problems that are most difficult to parallelize.
"For multicore, with 4 to 8
cores, there are shortcuts you can take to get legacy code to run in parallel,"
he said. "When it comes to manycore, 128 cores and all that, I think it will
have to be written from the ground up. I think a lot of fine-grained problems
aren't going to scale, just like [Madden] said."
Still, some clever things
can be done today. One is to create a "parallel reducer server" for parasitic
reduction. The idea is to take a net and reduce it to a math transfer function,
using parallel calls to the server. There are no threads or thread safety
issues, and the approach provides data locality, which will be important for
Another example is parallel
noise analysis. It is hard to make legacy code thread-safe, but you can use a
Copy on Write fork() construct as a short-term solution. It shares only "read"
memory with its parent, but if memory is modified, it is automatically
duplicated in the child.
easy-to-parallelize applications are more scalable, Syprou said that the
overall Encounter netlist-to-GDSII flow is now running about 2X faster on 4
cores, thanks to the work that's been done to parallelize legacy code.
I think there's some truth
in each of these presentations. Legacy serial code is really hard to
parallelize, and not everything should be parallelized. University research may
well focus too much on parallel programming and not enough on serial
bottlenecks. But let's not throw up our hands and say it can't be done. Let's
get to work and do what we can to speed up EDA software for multicore platforms,
just as Spyrou has been doing since 2006 at Cadence.
It is astonishing to read about what Patrick Madden had
to say, and of all places, at a conference attended by the
big/well connected group of people.
I find working on commercial parallel (especially fine-grain)
capabilities to be truly liberating and satisfying mainly
because there is a heavy premium on technical/intellectual
honesty. This is because, unlike in the serial world,
any parallel capability is its own competitor.
All the End-User needs to do is run experiments with
1,10,20,40,80,etc cores while leaving all other parameters
exactly the same, compute the speedups, efficiencies,
and iso-efficiencies, and judge for themselves what the
And, if the trend-lines are not favorable, there is nothing
the software vendors can do but acknowledge the problems.
Locating the processors close to a shared memory is what the Intel memory architecture does. Then you get bus contention. So you put in caches so you don't have to go to the bus so often. Then you put in stuff to keep the caches coherent. In the end you have a huge monster.
Or you can go the AMD way, and give each processor its own memory, and let the processors communicate to get to memory attached to other processors. I don't know how (or if) he operating system figures out which pages one process works on most and assigns/moves them to the right processor. In any case, you end up sending vast quantities of messages between processors to handle the shared memory, and you end up with a huge monster.
An advantage of a Message Passing style of programming is that the expensive part of the program, the communication, is made explicit. The disadvantage is that the communication is made explicit...
Agreed! Perhaps one of those "novel architectures" is 3D ICs, with stacked die including logic and memory. This was also a hot topic at the EDP workshop.
One interesting thought that was presented was that the real bottleneck is actually moving the data around. That is, if you could locate many processors close to a shared memory, that would help increase performance. Today, a lot of time is spent moving data over a network. Perhaps there are some novel architectures that will arise to help make this more efficient in the future.