The main reason AlphaGo Zero learns so much faster than its predecessors is beca...

smallnamespace · on Nov 5, 2017

Interestingly, the idea behind temporal difference learning is more or less the intuition behind how people price derivatives in finance.

The expected value of a contract at time T, estimated at some time t < T, is assumed to be equal (up to discounting) for all t -- e.g. if today we think the contract will be worth $100 a year later, then we also think that the expected estimate, made n months from now, of the value [12-n] months later, will also be $100. This allows you to shrink the state space considerably.

You can usually work out the payoff of a derivatives in different scenarios given rational exercise decisions by all contract participants. The calculation assumes that every market participant makes the best possible decision given the information they had available at the time by either explicitly or implicitly building a tree and working backwards, back-propagating the 'future' value back to the root.

This closely resembles the modeling of a discrete adversarial game, except the payoffs need to make reference to random market variables like the stock price, so the tree nodes are not just indexed by participant action, but also by variables.

There's actually a nice resemblance between the Longstaff-Schwarz method of pricing American options and MCTS + Alphago, except that the former is using kernel regressions instead of deep neural nets and we sample from a continuous space with an assumed probability distribution instead of a discrete space guided by a policy network [1].

[1] https://people.math.ethz.ch/~hjfurrer/teaching/LongstaffSchw...

arnioxux · on Nov 5, 2017

I think the bellman equation (which is used extensively in reinforcement learning) is also taught in stochastic calculus for finance (except in the continuous form?). https://en.wikipedia.org/wiki/Hamilton%E2%80%93Jacobi%E2%80%...

My memory is hazy so there might not be a real connection here.

jey · on Nov 5, 2017

Yup, and a lot more! The Hamilton-Jacobi-Bellman equations come up in anything that can be formulated as an optimal control problem.

I'm not an expert, but he is: http://www.athenasc.com/dpbook.html

waynenilsen · on Nov 5, 2017

Yes see book by oksendal

justinpombrio · on Nov 5, 2017

Or in the parlance of probability theory: "The expectation of the posterior probability, after viewing the evidence, must equal the prior probability."

http://lesswrong.com/lw/ii/conservation_of_expected_evidence...

(Notice the accompanying proof.)

gwern · on Nov 5, 2017

> These are the two biggest distinguishing characteristics of AlphaGo Zero compared to previous AlphaGos, and the OP doesn't discuss either of them.

David Silver disagrees. The most critical distinguishing characteristic is the expert/tree iteration which makes stable self-play possible at all.

cosminro · on Nov 5, 2017

A lot of people reading the paper miss this. I guess it's not emphasized enough.

In the first paper, the selfplay trained policy is about 1500 in elo rating, while darkforest2 a supervised trained policy from Facebook is around the same, if not better. So selfplay wasn't of much use the first time around. While in the AlphaZero paper the selfplay trained policy has about 3000 elo rating.

gwern · on Nov 5, 2017

> A lot of people reading the paper miss this. I guess it's not emphasized enough.

Yeah, it's hilariously underemphasized. 1 sentence, literally. Fortunately I was able to ask Silver directly and get confirmation that it's the tree iteration: https://www.reddit.com/r/MachineLearning/comments/76xjb5/ama...

_0ffh · on Nov 5, 2017

For anyone interested: Learn more on TD and RL in general from Sutton (inventor of TD-lambda) and Barto's book: http://www.incompleteideas.net/sutton/book/the-book.html

Sidenote: It used to be that simply googling "sutton barto book" would bring you to the right place with the first suggested link. Now this stuff is so popular all of a sudden, I needed to consult the link I had set on my own page in order to find it. It's curious how the growth of popularity of an idea will with time obscure it's own roots and primary sources. On the plus side, TIL that Sutton's working on a 2nd edition! =)

dfan · on Nov 5, 2017

Here is the latest publicly available draft of the 2nd edition (June 2017, 538 pages): http://incompleteideas.net/sutton/book/bookdraft2017june19.p...

_0ffh · on Nov 6, 2017

Better to use my link, not the deep one. On mine you can click through to the latest draft (as of writing this, November 5th, 2017)!

je42 · on Nov 5, 2017

very good book. but it is also very old. why are they using it only now ???

nilkn · on Nov 5, 2017

Are you sure AlphaGo Zero even uses temporal difference learning? The Nature paper suggests it does not and merely references some older programs that did. I think it just uses a somewhat custom form of self-play reinforcement learning combined with MCTS.

psb217 · on Nov 5, 2017

You are correct. There is no TD learning in AGZ. The value network is trained to directly predict the game outcome given the current state, and is not trained through "bootstrapping" based on the next state's value estimate.

dmix · on Nov 5, 2017

Interesting, a TD algorithm, developed by a Canadian AI researcher now working with Deepmind in the early 1990s, was previously used to beat expert players at Backgammon and advanced human understanding of the game:

> TD-Lambda is a learning algorithm invented by Richard S. Sutton based on earlier work on temporal difference learning by Arthur Samuel. This algorithm was famously applied by Gerald Tesauro to create TD-Gammon, a program that learned to play the game of backgammon at the level of expert human players.

> TD-Gammon achieved a level of play just slightly below that of the top human backgammon players of the time. It explored strategies that humans had not pursued and led to advances in the theory of correct backgammon play.

https://www.wikiwand.com/en/TD-Gammon

abhgh · on Nov 5, 2017

TD-Gammon was taught to us as a part of a classroom course on Reinforcement Learning (RL) in 2007. ML was known to a small set of people back then, there weren't many jobs in the area (this is in India), and even to many in this set, RL was either not known or not well known. It's interesting to see RL surge in popularity. In fact just a couple of weeks back, I was talking to the professor who taught us that course, and it was fun comparing ML/RL related awareness then to now :-)

nvarsj · on Nov 5, 2017

Yes, me too. TD was also pretty much considered useless, until it was used for backgammon. And like many things in the AI/ML world, no one really knew exactly why it worked so well.

Backgammon is also interesting in that there is a non-determinstic element - the dice roll on every turn. This is where TD seems to shine.

igravious · on Nov 5, 2017

OP explicitly discusses the second big distinguishing in the opening paragraph of the section titled, 'The Alpha Zero Neural Ne':

“The Alpha Zero algorithm produces better and better expert policies and value functions over time by playing games against itself with accelerated Monte Carlo tree search. The expert policy π and the approximate value function Ŵ are both represented by deep neural networks. In fact, to increase efficiency, Alpha Zero uses one neural network f that takes in the game state and produces both the probabilities over the next move and the approximate state value. (Technically, it takes in the previous eight game states and an indicator telling it whose turn it is.)”

Regarding whether OP touches on temporal-difference learning I am unqualified to say but they do not explicitly mention it. Furthermore I am unqualified to judge how central this technique is to the level of play achieved. However in the DeepMind paper (pg. 20)† that start talking about temporal-difference learning thus:

“Self-play reinforcement learning has previously been applied to the game of Go. NeuroGo[40, 41] used a neural network to represent a value function, using a sophisticated architecture based on Go knowledge regarding connectivity, territory and eyes. This neural network was trained by temporal-difference learning[42] to predict territory in games of self-play, building on prior work[43]. A related approach, RLGO[44], represented the value function instead by a linear combination of features, exhaustively enumerating all 3 × 3 patterns of stones; it was trained by temporal-difference learning to predict the winner in games of self-play. Both NeuroGo and RLGO achieved a weak amateur level of play.”

I'm no expert but this implies to me that it was probably the sum total of all the subtle architectural decisions made by the DeepMind team plus their AI hardware and software platform that made AlphaGo Zero excel.

† https://deepmind.com/documents/119/agz_unformatted_nature.pd...

eru · on Nov 5, 2017

Of course, these techniques you have mentioned were known before and tried. The big thing in AlphaGo Zero is that they found a way to make them work and the resulting architecture even manages to looks simple.

(Eg AlphaGo using two different networks for policy and valuation was a big breakthrough back when it was young, because people couldn't make the single network one work so well.)

For temporal difference learning the article about TD-Gammon, a backgammon AI from the early 90s, is great: http://www.bkgm.com/articles/tesauro/tdl.html (It's linked from the Wikipedia article you referenced, too.)

praptak · on Nov 5, 2017

The seminal article on TDRL: Sutton, R. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–34, PDF:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.132...

gabrielgoh · on Nov 5, 2017

could you elaborate on this point? what you're saying sounds like dynamic programming, which does not reduce the state space at all, just saves on redundant computations (and is a favourite of programming interviews everywhere)

psb217 · on Nov 5, 2017

Training via bootstrapping (i.e. dynamic programming) does reduce the state space for search when working with function approximation. It represents a bias in what sorts of values the value function approximator should predict for each state. It encodes a sort of "local continuity/coherence" constraint that wouldn't necessarily be induced by simply training to predict the raw values -- collected stochastically via interaction with the environment. This local coherence constraint acts as a regularizer (i.e. bias) while training the value function approximator.

mtgx · on Nov 5, 2017

How would OpenAI's self-play be different?

https://blog.openai.com/competitive-self-play/