AlphaZero: Four Hours to World Class from a Standing Start

16 Jan 2019 • 7 minute read

Last year I wrote about AlphaZero in my post Deep Blue, AlphaGo, and AlphaZero. A quick recap to jog your memory, or read that earlier post for more detail. Deep Blue was the program developed by IBM that defeated Gary Kasparov at chess in 1997, at the time the #1 chess player in the world. Two years ago, in 2017, AlphaGo beat Ke Jie, the #1 Go player in the world (I'll capitalize Go when it is the game, since it is a common word and sometimes makes sentences confusing). Later that year, AlphaZero defeated Stockfish, the #1 chess-playing program in the world, stronger than any human.

If you have ever played chess at anything more than a casual kid's level, then you know that there has been a lot of research in chess: standard openings analyzed, end-games analyzed, overall strategy for board position, and more. There are millions of recorded games by the best players. All of this information was incorporated into Stockfish, along with a powerful analysis engine. The analysis engine for any game like chess or Go works by considering some subset of possible moves, followed by some subset of possible responses, followed by some subset of possible responses to the response, and so on. If the game is really simple, like Tic Tac Toe to take an extreme example, then all moves and responses can easily be analyzed. Obviously, in more complex games, one important part of the algorithm is to decide which moves and responses to bother to analyze, and which can be ignored since they are "obviously" stupid. The other important part of this algorithm is a way to score how good a board position is, since there isn't enough compute power to analyze all the way to the end of the game. The idea is to find the next move that guarantees the best possible position, even if the opponent responds in the best possible way at each move. Computers are so powerful, especially if you have a lot of them, that they can analyze a lot further into a game than a human player can. That, matched with the playbook from centuries of analysis, makes them unbeatable. For a short time, the best "players" were a human player assisted by a program, but that era is over. Human players can only make the program worse.

AlphaZero

AlphaZero didn't work like that. It started from just the rules of chess (and Go, and shogi, a sort of Japanese version of chess). And...that's it. Not only did it not have access to the books of standard openings, it had never seen a single game of chess played. It played itself, and worked out good strategies for beating itself. To be honest, this sounds unpromising. For example, in cryptography, people are always coming up with new schemes that are so powerful that they themselves can't break them. Unfortunately, real experts can. You might expect a chess program to do the same. Work out some way to play chess that can win against a clone of itself for an opponent, but that turned out to be hopelessly weak when pitted against an expert. If you gave an eight-year-old the rules of chess, that would be pretty much what you would expect. Even if they had the focus to spend a long time on it, a good player would beat them easily. AlphaZero just took a few hours of training for each game to become the strongest player in the world.

A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go Through Self-Play

That's the title of a paper in Science by a bunch of authors (David Silver et al) from the DeepMind project at Alphabet (Google) that was published on December 7. There is a longish blog entry on the DeepMind website called AlphaZero: Shedding New Light on the Grand Games of Chess, Shogi, and Go.

The paper itself is on the Science website, and there is an open-access PDF version (although the Science paper doesn't seem to be gated, at least at present). This is the abstract of the paper:

The game of chess is the longest studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. By contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go by reinforcement learning from selfplay. In this paper, we generalize this approach into a single AlphaZero algorithm that can achieve superhuman performance in many challenging games. Starting from random play and given no domain knowledge except the game rules, AlphaZero convincingly defeated a world champion program in the games of chess and shogi (Japanese chess) as well as Go.

It seems there is a book coming out this month called Game Changer by Grandmaster Matthew Sadler and Women's International Master Natasha Regan analyzing thousands of AlphaZero's games. This is probably is a lot more than you are interested in, but the authors say that its style is unlike any traditional chess engine:

It is like discovering the secret notebooks of some great player from the past.

Another quote is from Gary Kasparov himself (the world champion who Deep Blue defeated):

I can't disguise my satisfaction that it plays with a very dynamic style, much like my own!

The image above shows how its skill improved as it went through more training steps. The y-axis shows the Elo rating, which is the principal chess player rating system (although it looks like it should stand for something, Elo is actually the name of the inventor of the system). The green horizontal line is Stockfish, the most highly rated chess program (by the way, Magnus Carlsen, the current world champion, has an Elo rating of 2835, far below Stockfish). It took AlphaZero just four hours to go from the rules of chess to surpass Stockfish (and, eyeballing the graph, about half that time to surpass Carlsen).

To learn each game, an untrained neural network plays millions of games against itself via a process of trial and error called reinforcement learning. At first, it plays completely randomly, but over time the system learns from wins, losses, and draws to adjust the parameters of the neural network, making it more likely to choose advantageous moves in the future. The amount of training the network needs depends on the style and complexity of the game, taking approximately 9 hours for chess, 12 hours for shogi, and 13 days for Go.

One interesting fact is that people used to insist that no program would ever be able to beat the best human Go players, since Go has so many choices at each point. If you aren't looking closely, you'll see that it took 9 hours for chess but 13 days for Go. One wrinkle is that chess and shogi can be drawn, but Go always has a winner, so there is no such thing as "playing for a draw". In fact, last year's world championship ended the main section with 12 draws (rapid chess was used as the tie-breaker, where Carlsen is regarded as unbeatable).

Since this is running on TPUs in Google datacenters, you might think that some of the strength just comes from being able to analyze more positions. But while a program like Stockfish analyzes tens of millions of positions, AlphaZero only analyzes tens of thousands. Part of its smarts is avoiding going down blind alleys early. The TPUs are not the current generation:

AlphaZero used a single machine with 4 first-generation TPUs and 44 CPU cores. A first-generation TPU is roughly similar in inference speed to commodity hardware such as an NVIDIA Titan V GPU.

Since a modern high-end smartphone contains something like four cores and a neural network processor of some sort, this is not a huge amount of compute power, maybe what you'd get in five to ten smartphones. AlphaZero is brilliant software, not just finding a clever way to bring a lot of compute power to bear on the problem, in the way that Deep Blue did twenty years ago.

For more details on the different TPU generations, see my post about Cliff Young's Linley Keynote in November Inside Google's TPU. The picture above and to the right is the v1 TPU that AphaZero used (the current generation is TPU v3). Since the v1 TPU can deliver 92 teraflops and a podful of v3 TPUs can deliver >100 petaflops, just going by the raw compute power (which you can't, but is probably order-of-magnitude correct), AlphaZero running on the latest hardware would become world class at chess in under three minutes!

If you are at all interested in neural networks and reinforcement learning, the paper is very readable. The paper itself, excluding references and supplementary material, is just five pages long.

Sign up for Sunday Brunch, the weekly Breakfast Bytes email.