Get email delivery of the Cadence blog featured here
The keynote at the recent Embedded Neural Network Symposium held recently at Cadence was given by Kunle Olukotun who is a professor at Stanford sponsored by Cadence. His keynote was titled Scaling Machine Learning Performance with Moore's Law. He started by pointing out that this is the golden era of data, but only for the best-funded teams staffed with a good leavening of PhD-level engineers.
He started talking about the DAWN project which enables anyone with domain expertise in some area to build their own production-quality machine learning products. No PhD in machine learning required; no expertise in databases, no deep understanding of the latest hardware. The goal is to speed up machine learning by 100X and improve performance/watt for 1000X. The approach is to have a full stack with algorithms, languages and compilers, and hardware.
There is an inherent tradeoff in iterative machine learning:
The first thing that you learn in a parallel computing course is that you must lock data before you access it or change it. But Hogwild! (yes, the exclamation point is part of the name) is an approach that runs multiple worker threads without locks. The threads work together on a single copy of the data with races all over the place, but the hardware is used much more efficiently than with the parallel computing method, where lots of the threads would typically be blocked waiting for another lock to be released.
The races can be modeled as noise. If the noise is below the existing noise floor then there is negligible effect on statistical efficiency. You get near linear speedup in terms of the number of threads. This approach has been used in many real systems in industry.
The next approach is Buckwild! (also exclamation point included), which uses 8- or 16-bit fixed point numbers instead of a 32-bit floating point. Obviously the fewer bits, the better the hardware efficiency, and if stochastic rounding is done then it preserves statistical efficiency too. Again, the errors, in this case, quantization errors, can be modeled as noise. And again, as long as the noise stays below the existing noise floor there is minimal effect on statistical efficiency. In fact, it turns out that we can go below 8 bits and still not sacrifice accuracy.
So the message is to relax: relax synchronization, relax precision, relax coherence. It's all only noise and the data is noisy already.
There are lots of ways to build specialized machine learning accelerators with dozens of cores in a multi-core SoC, GPUs, FPGAs, or even entire data centers. The problem is that these approaches all require experts in parallel programming and there just are not enough. What if we could write one machine learning program and then automatically run it on any of these architectures? This is where OptiML (no exclamation point!) comes in. It provides a familiar MATLAB-like language for writing machine learning applications. The data structures (vectors, matrices etc) are inherently parallel and there are implicitly parallel control structures (such as sum or loops).
The system that Kunle and his team are working on to use this language is called Delite. It has three main phases, one that takes the input language and does domain specific transformations, then a phase that does generic transformations, and finally a third phase that consists of code-generators for the specialized hardware that outputs Scala, C++, CUDA, OpenCL, MPI or Verilog. There are lots of parallel patterns supported, such as the map-reduce that underlies both Google's algorithms and the open-source Hadoop, join, filter and more.
Since the machine learning is expressed in a hardware independent manner, it automatically scales with Moore's Law. More cores, faster cores, bigger clusters, big FPGAs: these all provide a better computing fabric that Delite can automatically take advantage of.
The more specialized the hardware, the more energy efficient. We are not talking 20% here, we are talking factors of as much as 1000X. See the graph below. On the left are general purpose CPUs, which are as programmable as can be. Then CPU+GPU, then FPGA and finally specialized custom hardware (that is not programmable, you need to redesign the chip).
So the best of all possible worlds would be to find a sweet spot that has the programmability of CPUs and the energy efficiency of custom SoCs. This is software defined hardware (SDH). The goal is 100X performance per watt compared to a CPU, 10X performance per watt compared to GPUs or FPGAs, and 1000X the programmability of FPGAs (no Verilog required). The SDH in Kule's group is called Plasticine. It consists of a network of pattern compute units, memory compute units and switches.
Plasticine sits on the earlier graph way over to the top left, which is that sweet spot. Very high power efficiency, and very programmable.
We really can have it all: power, performance, programmability, portability. At the top level, there are better approaches to algorithms with Hogwild! and Buckwild!. In the middle, there are domain-specific languages such as OptiML, and the high-level compilation of Delite that maps everything into specialized highly parallel hardware.
It turns out I have a connection to Plasticine. It was invented by William Harbutt (in 1897, according to Wikipedia). In the UK, when I was growing up, it was manufactured by a company also called Harbutt's in Bathampton, just outside Bath. The managing director (CEO) was the inventor's son. And his son was my best friend in primary school, Richard Harbutt. I was amazed when I first was invited to a play date at his house. It was the first time I'd ever been in a huge house that these days we'd call a McMansion. There was obviously a lot of money in Plasticine. Apparently, production remained in Bathampton as late as 1983 when it was moved to Thailand. The company went through ownership changes and today Wikipedia says it is owned by a company that I've never heard of, called Flair.