Google/Deepmind/Silver and Huang’s new Go-playing program

There’s some buzz around recent improvements in Go-playing programs. Google made a program that is pretty good against human opponents but it uses 170GPUs and 1200CPUs!

paper: Mastering the Game of Go with Deep Neural Networks and Tree Search by David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van deniessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot,nder Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, Demis Hassabis

summary: they create a convolutional neural network with 13 layers to select moves (given a game position, it outputs a probability distribution over all legal moves, trying to assign higher probabilities to better moves). They train the network on databases of expert matches, save a copy of the trained network as ‘SL’ then train it further by playing it against randomly selected previous iterations of itself. Then they use the history of the move-selecting network playing against itself to generate a new training set consisting of 30 million game positions and the outcome of that game, with each of the 30 million positions coming from a separate game. They use this training set to train a new convolutional neural network (with 13 layers again, i think) to appraise the value of a board position (given a board position, it outputs a single scalar that attempt to predict the game outcome of that board position).

They also train ANOTHER move-predicting classifier called the ‘fast rollout’ policy; the reason for another one is that the fast rollout policy is supposed to be very fast to run, unlike the neural nets. The fast rollout policy is a linear softmax of small pattern features (move matches one or more response features, Move saves stone(s) from capture, Move is 8-connected to previous move, Move matches nakade patterns at captured stone, Move matches 12-point diamond pattern near previous move, move matches 3×3 pattern around candidate move). When a feature is “move matches some pattern”, i don’t understand if they mean that “match any pattern” is the feature, or if each possible pattern is its own feature; i suspect the latter, even though that’s a zillion features to compute. The feature weights of the fast rollout classifier are trained on a database of expert games.

Now they will use three of those classifiers, the ‘SL’ neural network (the saved network that tried to learn which move an expert would have made, before further training against itself), and the board-position-value-predicting network, plus the ‘rollout’ policy.

The next part, the Monte Carlo Tree Search combined with the neural networks, is kinda complicated and i don’t fully understand it, so the following is likely to be wrong. The idea of Monte Carlo Tree Search is to estimate the value of a board position by simulating all or part of game in which both players in the simulation are running as their policy a classifier without lookahead (eg within the simulation, neither player does any lookahead at each step); this simulation is (eventually) done many times and the results are averaged together. This simulation is called ‘rollout’. Each time the Monte Carlo simulation is done, the policy is updated. In this application, the ‘fast rollout’ is used in some parts of this simulation, but not all, because the ‘fast rollout’ policy is fixed after training, but the idea of Monte Carlo Tree Search is that the policy used is updated after each simulation.

In order to take one turn in the real game, the program does zillions of iterations; in each iteration, it simulates a game-within-a-game:

It simulates a game where the players use the current policy, which is represented as a tree of game states whose root is the current actual game state, whose edges are potential moves, and whose nodes or edges are labeled with the current policy’s estimated values for game states (plus a factor encouraging exploration of unexplored or underexplored board states).

When the simulation has visited the parent of a ‘leaf node’ (a game state which has not yet been analyzed but which is a child of a node which is not a leaf node) more than some threshold, the leaf node is added to a queue for an asynchronous process to ‘expand the leaf node’ (analyze it) (the visit-count-before-expansion threshold is adaptively adjusted to keep the queue short). This process estimates the value of the leaf node via a linear combination of (a) the board-position-value-predicting network’s output and (b) the outcome of running a simulation of the rest of the game (a game within a game within a game) with both players using the ‘fast rollout’ policy. Then, the SL neural network is used to give initial estimates of the value of each move from that board position (because you only have to run SL once to get an estimate for all possible moves from that board position, whereas it would take a long time to recurse into each of the many possible successor board positions and run the board-position-value-predicting network for each of these).

Because the expansion of a leaf node (including running SL) is asynchronous, in the mean time (until the node reaches the front of the analysis queue and is analyzed) the leaf node is provisionally expanded and a ‘tree policy’ is used to give a quick estimate of the value of each possible move from the leaf node board state. The tree policy is like the quick rollout policy but with a few more features (move allows stones to be captured, manhattan distance to two previous moves, Move matches 12-point diamond pattern centered around candidate move). The tree policy’s estimate will be replaced when the node reaches the front of the queue and is fully analyzed.

At the end of each iteration, the action values of all (non-leaf) nodes visited are updated, and a ‘visit count’ for each of these nodes is updated.

At the end of all of these iterations, the program actually plays the move that had the maximium visit count in the monte carlo tree search (“this is less sensitive to outliers than maximizing action-value”).

some more details:

  • During monte carlo tree search, they also use a heuristic called ‘last good reply’ which is sorta similar to caching.
  • the move-predicting networks are for the most part just fed the board state as input, but they also get a computed feature “the outcome of a ladder search”
  • because Go is symmetric w/r/t rotations of the board, the move-predicting networks are wrapped by a procedure that either randomly selects a rotation, or runs them for all rotations and averages the results (depending on whether or not the network is being used for monte carlo tree search or not)

 

Links: