Machine Learning for Higher Performance Machine Learning

27 Sep 2017 • 6 minute read

The second day keynote at the recent (well, over a month ago, there's been lots to blog about) HOTCHIPS conference in Cupertino was by Google's Jeff Dean, who heads up the Google Brain Team. He talked on Recent Advances in Artificial Intelligence and the Implications for Computer System Design. He gave an introduction to deep learning and deep neural networks. One point he made in his introductory remarks was to answer the question "Why now?" It turned out that this type of computing, which was first tried about 20 years ago, required more computer power than was available back then. People assumed a couple of turns of Moore's Law would make a difference, but the compute power was off by a factor of perhaps 1000 (a perfect fit for 20 years). It took huge datacenters full of blazingly fast multicore CPUs to have enough compute power that the networks could get deep enough to be useful.

You can't have but noticed in the last few years that deep neural networks are making significant strides in speech, vision, language, search, healthcare, and more. One message, the final conclusion of Jeff's talk was:

If you're not considering how to use deep neural networks to solve your problems, you almost certainly should be.

2008 NAE Goals

In 2008, the National Academy of Engineering defined 14 challenges for the 21st century. It turns out that over a third of them (5 out of 14) have aspects of deep learning, artificial brains and the like.

Restore and improve urban infrastructure: Self-driving cars, smart cities
Advance health informatics: Deep learning is already better than professionals at recognizing certain types of lesions, tumors, etc
Engineer better medicines: See below
Reverse-engineer the brain: Well, that is obviously dead center in this area
Engineer the tools for scientific discovery: See below

Engineer Better Medicines

Predicting the properties of molecules is part of drug discovery and also a (smaller) part of tools for scientific discovery. It turns out that there is a tool called DFT, which stands for (no, not discrete fourier transform) density functional simulator. It takes about 1000 seconds to analyze a molecule and decide various properties, such as whether it is toxic, whether it will bind with a certain protein, what are its quantum properties, and more.

Deep learning works well when there is extensive training data. Just the existence of Imagenet is one of the things that has driven improvements in vision recognition, finally having a training dataset with millions of tagged images. However, when you have an existing tool like DFT, it is easy to generate training data. Throw thousands or millions of molecules at the DFT simulator, then give the same molecule to the deep neural network, along with the output of the simulator. It turns out that the neural network learns well, and instead of taking 1000 seconds to decide if a molecule is worth further investigation, it takes 1/100th, 100,000 times faster.

My immediate thought on seeing this was to wonder if the same approach would work in EDA, circuit simulation being the first thing that sprang to mind. Can you, for example, speed up characterization of certain types of analog blogs by training a neural net to do the same job faster than Spectre can. Maybe that is too ambitious, and the problem set needs to be restricted to filters, or certain kinds of radios, but it seems an idea worth trying.

Engineer the Tools for Scientific Discovery

Google cerated TensorFlow, which is open, standard, software for general machine learning, and deep learning in particular (it was first released in 2015). The goals were to establish a common platform for expressing ideas in machine learning, and to open source it so that it is a platform for everyone (not just Google), and make it the best in the world for both research and production use.

It seems to be working, with over 800 non-Google contributors, 21,000 commits in 21 months, and many community-created tutorials and projects. It is in growing use in machine learning classes. Looking at the graph above, it appears to be winning the mindshare, using Github stars as a surrogate.

You have probably heard about the Google TPU, Tensor Processing Unit. This is a Google-designed chip for neural net inference, in production use for over 30 months, and used for search queries, machine translation, in the (in)famous AlphaGo match where computers beat the #1 Go player in the world, and more. The first v1 version was useful for inference but not for training. So they designed v2. This was for both training and inference. It has two 8GB high-bandwidth memories (HBM) delivering 600GB/s of memory bandwith. It delivers 45TFLOPS. It is also designed to be connected into larger configurations.

Here's what connecting it together looks like. This is a TPU Pod with 64 second-generation TPUs, delivering 11.5 petaflops on top of terabytes of HBM memory. And it can be yours (soon). Well, not yours, you can't buy one, but it will be available through Google Cloud in the form of a Cloud TPU: a virtual machines with a180TFLOPS TPU attached, programmed via Tensorflow. WIth only trivial modifications, the same program will run on CPUs, GPUs, and now TPUs.

Machine Learning for Higher Performance Machine Learning

For large models, model parallelism is important. But getting good performance when you have multiple computing devices (and not all the same type) is non-trivial and non-obvious. But machine learning is great for attacking non-trivial and non-obvious problems, so why not use it to learn how to improve the partitioning. It turns out this works really well, and can produce better model placement than human experts can do.

Reducing Inference Cost

The first trick is to quantize: most models tolerate very low precision for weights (8 bits or even less), which results in 4X memory reduction and 4X increase in computational efficiency.

Something that we at Cadence have done a fair bit of work in is distillation, taking a giant, highly accurate model, and getting a smaller simpler model with almost the same accuracy to run on a phone or a Tensilica processor. Google has been getting very good results with this. We also have very good results. See my post CactusNet: One Network to Rule Them All.

Learning to Learn

The dream is automated machine learning, obviating the need for a huge training dataset. As I said in my post Embedded Vision Summit: It's a Visual World, toddlers have a trick that vision researchers dream about. You take a toddler to the zoo. You point to an animal. You say "that's a zebra." That's all it takes. Vision algorithms need a few thousand pictures of zebras in all configurations to get the same level of recognition. Today's approach relies on machine learning expertise, plus data, plus computation.

The question is whether we can substitute computation, which is cheap, for the data, and move the equation to data plus 100X more computation. But no data. There are early encouraging signs using reinforcement learning. Generate ten models, train them for a few hours, and then use loss of the generated models as reinforcement learning signal.

Another approach is to look at the various approaches humans program to update the weights during machine learning...and then use machine learning to find a better way to do it.

What Might the Future Look Like

We would like something more like a brain, a single large model where different parts are used for different tasks, that can reconfigure itself based on experience. Then, of course, we need better computational substrate (like the TPU and other such approaches) along with a way to map those powerful, sparse models on this hardware.

Video of the Keynote

You can watch a video of the entire keynote (about an hour):

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.