The main reason AlphaGo Zero learns so much faster than its predecessors is because it uses temporal-difference learning.[1] This effectively removes a huge amount of the value network's state space for the learning algorithm to search through, since it bakes in the assumption that a move's value ought to equal that of the best available move in the following board position, which is exactly what you'd expect for a game like Go.
A secondary reason for AlphaGo Zero's performance is that it combines both value and policy networks into a single network, since it's redundant to have two networks for move selection.
These are the two biggest distinguishing characteristics of AlphaGo Zero compared to previous AlphaGos, and the OP doesn't discuss either of them.
Interestingly, the idea behind temporal difference learning is more or less the intuition behind how people price derivatives in finance.
The expected value of a contract at time T, estimated at some time t < T, is assumed to be equal (up to discounting) for all t -- e.g. if today we think the contract will be worth $100 a year later, then we also think that the expected estimate, made n months from now, of the value [12-n] months later, will also be $100. This allows you to shrink the state space considerably.
You can usually work out the payoff of a derivatives in different scenarios given rational exercise decisions by all contract participants. The calculation assumes that every market participant makes the best possible decision given the information they had available at the time by either explicitly or implicitly building a tree and working backwards, back-propagating the 'future' value back to the root.
This closely resembles the modeling of a discrete adversarial game, except the payoffs need to make reference to random market variables like the stock price, so the tree nodes are not just indexed by participant action, but also by variables.
There's actually a nice resemblance between the Longstaff-Schwarz method of pricing American options and MCTS + Alphago, except that the former is using kernel regressions instead of deep neural nets and we sample from a continuous space with an assumed probability distribution instead of a discrete space guided by a policy network [1].
Or in the parlance of probability theory: "The expectation of the posterior probability, after viewing the evidence, must equal the prior probability."
A lot of people reading the paper miss this. I guess it's not emphasized enough.
In the first paper, the selfplay trained policy is about 1500 in elo rating, while darkforest2 a supervised trained policy from Facebook is around the same, if not better. So selfplay wasn't of much use the first time around. While in the AlphaZero paper the selfplay trained policy has about 3000 elo rating.
Sidenote: It used to be that simply googling "sutton barto book" would bring you to the right place with the first suggested link. Now this stuff is so popular all of a sudden, I needed to consult the link I had set on my own page in order to find it. It's curious how the growth of popularity of an idea will with time obscure it's own roots and primary sources. On the plus side, TIL that Sutton's working on a 2nd edition! =)
Are you sure AlphaGo Zero even uses temporal difference learning? The Nature paper suggests it does not and merely references some older programs that did. I think it just uses a somewhat custom form of self-play reinforcement learning combined with MCTS.
You are correct. There is no TD learning in AGZ. The value network is trained to directly predict the game outcome given the current state, and is not trained through "bootstrapping" based on the next state's value estimate.
Interesting, a TD algorithm, developed by a Canadian AI researcher now working with Deepmind in the early 1990s, was previously used to beat expert players at Backgammon and advanced human understanding of the game:
> TD-Lambda is a learning algorithm invented by Richard S. Sutton based on earlier work on temporal difference learning by Arthur Samuel. This algorithm was famously applied by Gerald Tesauro to create TD-Gammon, a program that learned to play the game of backgammon at the level of expert human players.
> TD-Gammon achieved a level of play just slightly below that of the top human backgammon players of the time. It explored strategies that humans had not pursued and led to advances in the theory of correct backgammon play.
TD-Gammon was taught to us as a part of a classroom course on Reinforcement Learning (RL) in 2007. ML was known to a small set of people back then, there weren't many jobs in the area (this is in India), and even to many in this set, RL was either not known or not well known. It's interesting to see RL surge in popularity. In fact just a couple of weeks back, I was talking to the professor who taught us that course, and it was fun comparing ML/RL related awareness then to now :-)
Yes, me too. TD was also pretty much considered useless, until it was used for backgammon. And like many things in the AI/ML world, no one really knew exactly why it worked so well.
Backgammon is also interesting in that there is a non-determinstic element - the dice roll on every turn. This is where TD seems to shine.
OP explicitly discusses the second big distinguishing in the opening paragraph of the section titled, 'The Alpha Zero Neural Ne':
“The Alpha Zero algorithm produces better and better expert policies and value functions over time by playing games against itself with accelerated Monte Carlo tree search. The expert policy π and the approximate value function Ŵ are both represented by deep neural networks. In fact, to increase efficiency, Alpha Zero uses one neural network f that takes in the game state and produces both the probabilities over the next move and the approximate state value. (Technically, it takes in the previous eight game states and an indicator telling it whose turn it is.)”
Regarding whether OP touches on temporal-difference learning I am unqualified to say but they do not explicitly mention it. Furthermore I am unqualified to judge how central this technique is to the level of play achieved. However in the DeepMind paper (pg. 20)† that start talking about temporal-difference learning thus:
“Self-play reinforcement learning has previously been applied to the game of Go. NeuroGo[40, 41] used a neural network to represent a value function, using a sophisticated architecture based on Go knowledge regarding connectivity, territory and eyes. This neural network was trained by temporal-difference learning[42] to predict territory in games of self-play, building on prior work[43]. A related approach, RLGO[44], represented the value function instead by a linear combination of features, exhaustively enumerating all 3 × 3 patterns of stones; it was trained by temporal-difference learning to predict the winner in games of self-play. Both NeuroGo and RLGO achieved a weak amateur level of play.”
I'm no expert but this implies to me that it was probably the sum total of all the subtle architectural decisions made by the DeepMind team plus their AI hardware and software platform that made AlphaGo Zero excel.
Of course, these techniques you have mentioned were known before and tried. The big thing in AlphaGo Zero is that they found a way to make them work and the resulting architecture even manages to looks simple.
(Eg AlphaGo using two different networks for policy and valuation was a big breakthrough back when it was young, because people couldn't make the single network one work so well.)
For temporal difference learning the article about TD-Gammon, a backgammon AI from the early 90s, is great: http://www.bkgm.com/articles/tesauro/tdl.html (It's linked from the Wikipedia article you referenced, too.)
could you elaborate on this point? what you're saying sounds like dynamic programming, which does not reduce the state space at all, just saves on redundant computations (and is a favourite of programming interviews everywhere)
Training via bootstrapping (i.e. dynamic programming) does reduce the state space for search when working with function approximation. It represents a bias in what sorts of values the value function approximator should predict for each state. It encodes a sort of "local continuity/coherence" constraint that wouldn't necessarily be induced by simply training to predict the raw values -- collected stochastically via interaction with the environment. This local coherence constraint acts as a regularizer (i.e. bias) while training the value function approximator.
A secondary reason for AlphaGo Zero's performance is that it combines both value and policy networks into a single network, since it's redundant to have two networks for move selection.
These are the two biggest distinguishing characteristics of AlphaGo Zero compared to previous AlphaGos, and the OP doesn't discuss either of them.
[1] https://en.wikipedia.org/wiki/Temporal_difference_learning