The Wednesday keynote speech at the Design Automation Conference offered a strong argument for general-purpose graphical processing units (GPGPUs) as the best way to accelerate EDA and other compute-intensive applications. But whether GPGPUs will prove to be a better solution than more conventional multicore architectures is a difficult question to answer.
The speaker was William Dally, chief scientist at nVidia and professor of engineering at Stanford University. He’s a compelling speaker – this is one keynote that held the audience’s attention. His keynote was titled “The End of Denial Architecture and the Rise of Throughput Computing,” and that’s a pretty good summary of the content of the speech.
Few would disagree with Dally’s opening statement. Citing decreasing performance improvements in single-threaded processors, he noted that “we can’t afford to be in denial about the shift to parallelism.” All performance gains in the future will come from parallelism, he stated. And he argued that efficiency will come from “locality.” Moving a word across a die is very expensive in terms of energy, so it’s best to keep things local, he said.
Dally said that single-threaded processors are in “denial” about parallelism and locality. They provide two illusions. First, they try to exploit instruction level parallelism (ILP), which has limited scalability. Secondly, flat memory denies locality and provides the “illusion of caching,” which turns out to be inefficient if the working set of data doesn’t fit into the cache.
Dally then went on to argue that “latency-optimized processors” are improving in performance very slowly, while “throughput-optimized processors” are improving at 70 percent per year. He cited the nVidia GeForce, which has 240 scalar processors, as an example of a “throughput” processor. At this point, however, I could have used a clearer definition of these terms and some more examples of what he would include in either category.
Dally turned the discussion to EDA, and here is where it gets more controversial. He cited the obvious need for parallel computing platforms for EDA applications, but said that multicore chips comprised of 4 or 8 “latency-optimized processors” are a “slippery slope” that cannot come close to delivering the performance-per-watt of a “throughput-optimized” architecture with hundreds of processors. “Going parallel on a latency-optimized processor is besides the point. You’re not getting the gains of parallelism if you do that,” he said.
Why, then, are there so few EDA applications on GPGPUs today, and why are most EDA developers targeting what Dally would probably call “latency-optimized” multicore architectures? To get another perspective I asked Tom Spyrou, distinguished engineer at Cadence, whom I interviewed previously about his work in porting the Encounter Digital Implementation System to multicore platforms.
Tom noted that GPGPUs work better on a subset of parallel data problems, but most EDA applications have significant data manipulations, which limits the maximum speedup to a given level no matter how many CPUs are involved. 16-core CPUs are coming down in price and will soon be as cheap as today’s dual-core CPUs. GPUs, meanwhile, require a rewrite and some special coding skills.
On the other hand, as U.C. Berkeley professor Kurt Keutzer noted in a recent interview, if you want “manycore” (more than 32 CPUs) parallelism, GPGPUs are the commercial platforms available today. And programming environments are available. While the CUDA programming environment was developed by nVidia, OpenCL aims to create an open standard for programming GPGPUs.
My take is that EDA developers will write applications for whatever compute platforms engineers use. Whether future platforms will be based on “latency-optimized” multicore devices or “throughput-optimized” GPGPUs remains to be seen. It’s not just a question of performance-per-watt, or whether you have 32 processors or 500. The ultimate question is going to be the effective speedup-per-dollar.
But no matter which direction future compute platforms take, Dally is absolutely right about one thing – it’s time to give up “denial” and move forward into an era of parallelism.
nowadays we have more processor cores per IC from AMD (well, they have max of 8 modules on a single chip each with two cores sharing specific resources). Does that mean that nowadays we can run faster on AMD than on Intel?
When will nVidia and/or AMD design their own ICs on a GP-GPU EDA platform?
Thanks, Richard and hello again! Without the benefit of a recording or presentation from Professor Daily's talk, I can only go by impressions from various reporters like yourself. The term 'general purpose graphics processing unit' seems almost at odds (like George Carlin's joke, 'military ... intelligence?' ... with all due respect to those who have served, of course). From your suggestion, maybe someone needs to come up with a processor containing a VLIW (or SIMD) machine with EDA-friendly custom processors. Which would render their utility limited outside of those applications but, may, bring a revival of the term 'engineering workstation'! (:
But seriously, I think that any processor customization renders it to a limited class of problems with the same underlying algorithms. A processor targeted for fluid dynamics, for example, could be used by those who study traffic (e.g., Northwestern University's Traffic Institute) where fluid dynamics concepts have been applied to model rush hours!
I wonder if such a debate is raging on some computer architecture blogs out there? Maybe some computer architecture experts or dilettantes out in the EDA audience might want to weigh in ...
Thanks to you and to Cadence for this site! (:
Excellent points, Gary. The question now is which EDA applications will fit within that "subset" of problems that GPUs are ideally suited for, and how much speedup they'll provide for how much cost. In many cases, general-purpose 16 and 32-processor multicore ICs may be good enough.
Hi again, Richard ... from the abstract of William Daily's talk, published in advance of DAC, I had the impression that he would advocate parallelism not in the form of MIMD (i.e., multicore, with general purpose processors) but with coprocessors (GPU's, in this case) on a single computing element ("chip"). Thanks to you and other reporters/bloggers, more useful details despite that I am in ... Winnipeg, Manitoba, Canada! (:
My impression now is that Professor Daily is advocating an approach beyond VLIW, to implement a processor that is, itself, a SIMD machine with a number of GPU's. That would be fine but as your colleague Tom Spyrou points out, GPU's are custom processors optimized for a certain set of algorithms to efficiently solve a certain class of problems. As is the case with DSP's. A super-duper processor implemented from a SIMD machine with GPU's and/or DSP's would probably not fare much better than a general purpose chip (e.g., Intel, PowerPC, ARM, etc.) if faced with a different problem they are not targeted for, say ... search!