Wave Computing: a Dataflow Processor for Deep Learning

2 Nov 2016 • 4 minute read

On the first day of the Linley Processor conference someone told me that he was really looking forward to the presentation the following day from Wave Computing. I had to confess I'd never heard of them. That wasn't entirely surprising. Chris Nicol, their CTO, said it was the first time that they had described their DPU (dataflow processing unit) in public. It was meant to have been taped out before the conference but design closure has taken longer than they expected (I'm shocked, that's never happened before!), but it should tape out this month, October.

Deep learning networks are dataflow graphs that are then programmed on deep learning software (such as Tensorflow or Caffe). But Wave Computing believe that normal processors, or even GPUs, are not ideal for deep learning dataflow computations. Better is to use a dataflow processor. That's what they have built.

The heart of the system is a custom chip, the dataflow processing element, or PE. When I say custom chip, I mean not just that they built their own chip but it was done using a custom design methodology, not a standard-cell methodology.

The chip is built up hierarchically. The leaf level of the hierarchy is a dataflow-processing element. These are built in groups of four that are fully connected, the output from one PE being available to any of the other three PEs on the next cycle (I was going to say the next clock-cycle, but the design is unclocked and data-driven). A single element looks like this:

The next level up is a cluster of 16 dataflow PEs, along with eight DPU arithmetic units. Because the designs are self-time, it is scalable to low voltages. In fact, not only is there no clock, there are no global signals at all. Not even a reset.

The next level up again is 8-64 clusters, giving 128-1024 PEs.

Power management is data-driven. When a PE is "asleep", it wakes up if it is sent data. It sleeps if it executes a "sleep" instruction, and can then opt for either fast wakeup or slow wakeup, a sort of deep sleep with even lower power.

Up another level and we are the board. A board contains four DPUs, with 64,000 processors, 64MB SRAM, 8GB HMC DRAM, and 256GB DDR4.

Up another level and we are the box. Four boards go in a box, giving 256,000 processors, 256MB distributed SRAM, 32GB HMC DRAM, and 1TB DDR4.

This is designed to be plug-and-play with no change required to existing datacenter. Tensor Flow (or whatever deep learning infrastructure you are using) just runs faster. A lot faster.

Inception V3 is a model implemented for Tensor Flow, also called DeepDream (I guess you have to have seen the movie). The training time on 1.28 million images is 15 hours. This compares to thousands of hours on CPUs and hundreds of hours on GPUs.

Here are the statistics for a single chip. It is 400mm² (so 20mm on a side). It is all full-custom, which is extremely tricky. As Chris put it, "16FF rules were never intended to be used by humans."

16FF CMOS process node	16,000 processors, 8192 ALUs	Self-timed MPP synchronization
181 peak teraops	16MB distributed memory	8MB distributed instruction memory
1.71 TB/s I/O bandwidth	270 GB/s peak memory bandwidth	2048 outstanding memory requests
4 billion 16B random access transfers	4 hybrid memory cube interfaces	2 DDR4 interfaces
PCIe 3 16-lane host interface	Andes N9 MCU	1MB program store for paging
Hardware engine for fast loading of AES encrypted program	32 dynamic reconfiguration zones	Variable fabric dimensions, user-programmable at boot

When the chips are all combined up into boards and then a box, you end up with 2.9 petaops/second. That's a lot. Written out in full, that's 2,900,000,000,000,000 op/s.

During the Q&A, Chris was asked how this compared to Google's TPU (Tensor processing unit). He said that they haven't published a lot of details. But it is an ASIC that does one job and does it well. It is a PCI accelerator connected to the host, and the rest of the code is not accelerated. In any case, it is designed more for inferencing than for training. But deep learning algorithms are changing rapidly and so only a programmable solution can survive obsolescence.

Chris's final remark: Wave's DPU "accelerates deep learning from weeks to hours, from hour to seconds, enabling artificial intelligent systems that operate in real time." But he assured us all, "We are not Skynet."

Next: Automotive Is a Pot of Gold Guarded by a Dragon

Previous: Segars and Son