A History of Neural Networks

19 May 2020 • 8 minute read

Research on biological neurons started back in the 1940s, before computers, and long before integrated circuits. Some research started at IBM in the 1950s to model neurons and, in 1956, the Dartmouth Summer Research Project on Artificial Intelligence kicked off more work. John von Neumann suggested that it might be possible to imitate simple neuron functions with relays or vacuum tubes. Frank Rosenblatt at Cornell began work on the Perceptron in 1958. It was built in hardware and is the oldest neural network still in use today. It was originally just software, but:

This machine was designed for image recognition: it had an array of 400 photocells, randomly connected to the "neurons". Weights were encoded in potentiometers, and weight updates during learning were performed by electric motors.

In 1969, Marvin Minsky and Seymore Pappert produced a famous book Perceptrons that pointed out some of the limitations of single-layer neural networks. As a result of this book, interest in neural networks declined for two decades to a low in about 1990. This period is sometimes called the "AI WInter". Funding was cut for AI, the AI-based Japanese 5th Generation computer project ended, all the companies selling LISP machines failed, and so on.

There was a general feeling in that era that AI and neural networks were interesting, but could only work on toy problems. They would never be useful on any real-world problem that couldn't already be solved easily some other way. In fact, in both the UK and the US, there were government changes (The Lighthill Report in the UK and the Mansfield Amendment in the US) that led to the withdrawal of most funding.

ImageNet and ILSVRC

It turned out that neural networks required much more data than anyone expected to train them. I think that this was partly because people didn't have a lot of data, so they had no way to find out. Also, the amount of computer power available a decade ago was limited. Amazon Web Services (AWS) was the first cloud supplier. They launched in 2006 but rolled out slowly. For example, they didn't launch in Europe until 2009.

But something else critical happened in 2009. Amazingly, it was a poster session at a conference, which usually means that the paper wasn't considered of a high enough standard for acceptance for presentation at the conference itself. The conference was CVPR, the Conference on Computer Vision and Pattern Recognition. The poster was presented by a group from the Princeton Computer Science department. Its title was ImageNet: A Large-Scale Hierarchical Image Database. As I put it in my post when wrote about it, it was ImageNet: The Benchmark that Changed Everything. To show just how much everyone believes this, at the recent Linley Processor Conference, one of the presenters talking about speech recognition says "we need an ImageNet moment".

Despite its name, ImageNet is not a neural network. It is a collection of annotated images. The images do not form part of ImageNet, they are just photos on the net the people have put there. But in addition to the images, there are annotations of what is in the picture ("there is a cat in this image"). The annotations were crowd-sourced. Today there are over 15 million images classified into over 20,000 categories.

Until ImageNet, neural networks couldn't do as well as more algorithmic approaches to image recognition (finding edges, finding eyes, that sort of thing). The type of neural network used for image recognition is called a convolutional neural network or CNN. Each neuron in the first layer of the network takes in a tiny bit of the raw pixels of the image. CNNs rapidly improved to the point that today they are superhuman. If you think that you can do as good a job as any computer at recognizing dogs, see how well you do distinguishing 120 different dog breeds. This was given additional focus the following year by the ILSVRC, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Researchers competed to achieve the highest recognition accuracy on several tasks.

As I said in my ImageNet blog post linked above:

In 2010, image recognition was algorithmically based, looking for features like eyes or whiskers. They were not very good, and a 25% or larger error rate was normal. Then suddenly the winning teams were all using convolutional neural networks, and the error rates started to drop dramatically. Everyone switched, and the rates fell to a few percent.

Gradient Descent and Backpropagation

Although neural networks worked if you got the weights correct, it was not well understood how to go about training them. The breakthrough was a technique called backpropagation. Say you have a simple neural network for recognizing the digits 0 to 9. You set all the weights to random values to get going. You show it a 7, and it says the most likely number is 9. The challenge is to tweak the weights so that it is more likely to say 7 next time. The training process has to go back through the CNN and make small adjustments to the weights. It is actually a sort of hill-climbing algorithm like many EDA algorithms or the famous Simplex linear programming solution. You look for which input to a neuron is causing the most problem and adjust the weight to minimize that. Then go to the previous layer.

Again, the driver was the availability of huge amounts of data and huge amounts of computer power to do the processing.

A good review paper from that period is in Nature, by Geoff Hinton, Yann LeCun, and Yoshua Bengio, simply titled Deep Learning. They point out just how far out in the wilderness the work they were doing on backprogagation was considered to be:

In the late 1990s, neural nets and backpropagation were largely forsaken by the machine learning community and ignored by the computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with little prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima — weight configurations for which no small change would reduce the average error.

That all changed in 2012, less than a decade ago. Another quote from the Nature paper:

Despite these successes, [CNNs] were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spectacular results, almost halving the error rates of the best competing approaches

Only a few years later, at the 2014 Embedded Vision Conference, Yann LeCunn gave one of the keynotes. He had a little video camera attached to his laptop, which was running a neural network trained on ImageNet. He pointed it at various things — his shoe, his mouse, a cup of coffee — and his computer said what it was. I've seen lots of image recognition demos since then but it was an eye-opener for me. It wasn't even a great setup: running on a laptop with a cheap camera in poor lighting.

Geoff Hinton, Yann LeCun, and Yoshua Bengio would go on to receive the Turing Award in 2019 (technically the 2018 prize) for their work in this area. I wrote that up when the award was announced in my post Geoff Hinton, Yann LeCun, and Yoshua Bengio Win 2019 Turing Award.

Unsupervised Learning

So far all of this is what is called supervised learning. The brass ring of neural networks is unsupervised learning.

I think that is best exemplified by a quote from another Embedded Vision Conference keynote, this time in 2017 by Jitendra Malik of UC Berkeley:

By the age of two or three, kids have become visual learning machines. They can tell the difference between cats and dogs with at most hundreds of examples. But then they have a trick that visual researchers can only dream about. You take them to the zoo and say "that is a zebra." That's all it takes. We still need a few thousand pictures of zebras for training.

What your toddler does is unsupervised learning. Or sometimes called self-supervised learning, which is more accurate.

In very restricted domains, such as playing Chess and Go, a lot of progress has been made. See my post from a couple of years ago Deep Blue, AlphaGo, and AlphaZero. It is amazing enough that AphaGo can become good enough to beat both world champions at Chess and Go. It was programmed by experts and had access to the world's best Chess computers (and Go computers?), all the literature on openings and endgames. In effect, all the accumulated knowledge about the games from the best humans.

AlphaZero is even more amazing. It just got given the rules of Chess (or Go). It played against itself and had to work all that out from scratch. When you think that it takes a smart human a decade to get really good at chess, it is amazing that AlphaZero can do it faster. And not just a little faster, it took AlphaZero less than a day to be able to beat Stockfish, the world's most powerful chess program (which is already more powerful than the world champion Magnus Carlsen).

Of course, Chess and Go have comparatively simple rules compared to, say, driving or medical diagnosis. As an article in The Atlantic Monthly said "AI Keeps Mastering Games, But Can It Win in the Real World?"

Or in the world of chip design?

A lot of EDA has fairly simple "rules" in the sense that the function of a chip or system is fixed by the RTL, and we are optimizing some combination of power, performance, and area (PPA). That can be a bit subtle since they are all important: how much area would you give up for another 50MHz of operating frequency? There are actually other dimensions such as reliability, testability, thermal, so even PPA is an oversimplification. But the same techniques that have revolutionized image and speech recognition, let alone becoming world-class at Chess from a standing start, can be applied to improve EDA tools, too, one part of what Cadence calls Intelligent System Design.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.