Comparing against Stockfish 8 in a paper released today and labeling it as "Stockfish" is bordering on being dishonest. The current stockfish version (14) would make AlphaZero look bad, so they don't include it ...
The name of the game here is generality. For a really general agent, they are looking to have superhuman performance, not get state of the art on every individual task. Beating stockfish 8 convinces me that it would be superhuman at chess.
They could still be honest that it's Stockfish 8, not the Stockfish everyone has. Your product having genuine value does not excuse lying about that value.
I observed this kind of behavior in many papers nowadays. This extremely painful for research, because some better candidates could be overseen and FAANG publishs a majority in the ML-paper section. Its a mess.
> one of the strongest and most widely-used programs is Stockfish [81].
Here's the citation, note the date:
> [81] The Stockfish Development Team. Stockfish: Open source chess engine, 2021. https://stockfishchess.org/.
They mention the version number only once, further down, and don't point out that it's out of date since February 2018. All other 11 mentions of it don't have the version number, like in that sentence:
> In Chess, PoG(60000,10) is stronger than Stockfish using 4 threads and one second of search time.
Yeah, so it is! I guess I ran into the same weirdness as ShamelessC, since when I first Ctrl-F:ed the PDF, hit 1/11 was on page 11. Now that I try my damndest to reproduce it, I get 12 hits and the first is that one on page 10.
The first mention says "Stockfish 8, level 20" in the paper. This isn't a blog post that you can skim, you need to read the whole thing before critiquing.
That's actually the second mention, the first is when they introduce the games in section 4:
> Today, computer-
playing programs remain consistently super-human, and one of the strongest and most widely-used
programs is Stockfish.
They also go back to referring to it as Stockfish for the rest of the paper.
An analogous situation in my mind would be if AMD released a new CPU and benchmarked it against an Intel CPU, only mentioning once, somewhere in the middle of the paper, that it was a Pentium 4.
This sort of evasiveness around speaking on method limitations, down playing or de-emphasizing related work but boosting senior authors previous work is standard academic fare. It's partly a strategy against novelty nitpickers and results in a net negative for all.
I also suspect part of the reason they chose Stockfish 8 was as a basis of comparison with AlphaZero. Their baselines for Go and poker are also pretty weak so their emphasis is clearly on displaying generality and reduced domain specialized input, not supremacy.
A single algorithm to play perfect and imperfect information games is difficult to achieve. Standard depth limited solvers and self-play RL result in highly exploitable agents. PoG appears to be very strong at Chess, decently strong at Go and decent at Poker (Facebook AI's ReBeL, the strongest prior work in this area, performed better against slumbot). What's unique about PoG is its ability to also play an imperfect information game (Scotland Yard) that has many rounds and a relatively long horizon (although it still has scaling issues).
It really isn't though. Technical papers have conventions, and they following them reasonably. You expect the methods description to be specific, the abstract not to be hyperbolic, and conclusions to be balanced. The general discussion parts are just that, general.
In the methods area they discuss the exact versions and parameters used, and how they compared them.
In the conclusions:
In the perfect information games of chess and Go,PoG performs at the level of human experts or professionals, but can be significantly weaker than specialized algorithms for this class of games, like AlphaZero, when given the same resources.
It would have perhaps been interesting to include a more recent stockfish, but it wouldn't really impact the paper.
> Today, computer- playing programs remain consistently super-human, and one of the strongest and most widely-used programs is Stockfish.
This is just a general effort to describe the present state of things. When they explicitly describe their evaluation process, they are sure to use the version number. They then _immediately_ drop the version number in subsequent usage which is culturally standard in research papers so they don't concern themselves with minute details of every single thing they find themselves redescribing. Believe me, you don't want to read the verbose version of this paragraph.
> In chess, we evaluated PoG against Stockfish 8, level 20 [81] and AlphaZero. PoG(800, 1) was run in training for 3M training steps. During evaluation, Stockfish uses various search controls: number of threads, and time per search. We evaluate AlphaZero and PoG up to 60000 simulations. A tournament between all of the agents was played at 200 games per pair of agents (100 games as white, 100 games as black). Table 1a shows the relative Elo comparison obtained by this tournament, where a baseline of 0 is chosen for Stockfish(threads=1, time=0.1s).
I'd be interested to see that benchmark. A ~3 GHz Pentium 4 sounds like a good reference point for single threaded performance since it's a reasonably modern OoO microarchitecture and reflects the moment that clock scaling stopped.
With a smaller cache, a less efficient branch predictor and only SSE for SIMD, I'd be curious to see the benchmark too but I'd be surprised if it was close.
I don't know if the RAM bandwidth being much lower would have an impact on CPU benchmark though.
I obviously read it, otherwise I wouldn't have known which version they are using. They are banking on others, that do just skim the figures and tables, not noticing their usage of outdated baselines.
Isn't the point comparing traditional heuristic techniques against DNN-learned techniques? I understand the latest Stockfish is etching quite close to AlphaZero techniques, but maybe I am wrong.
It does have the option to use a neural network (nnue) in its evaluation, but it is very different from what AlphaZero/Lc0 do. You can choose not to use it, so you still could have a "traditional" evaluation (which would still blow Stockfish 8 out of the water). Also, Stockfish 8 isn't the last version before they merged nnue ...
The abstract clearly states that the best chess and Go bots are not beaten: "Player of Games reaches strong performance in chess
and Go, beats the strongest openly available agent in heads-up no-limit Texas hold’em poker
(Slumbot)..."
The problem with poker is that there is money to be made from having a strong AI so there is 0 incentive to release it. What's publicly available are solvers (which solve game abstractions similar to the full game but don't play themselves) and shitty bots.
The abstract claims they beat the "strongest openly available agent in heads-up no-limit Texas hold'em poker". To a non-expert that certainly sounds like they're claiming to be the best
As noted before, the reason for including old tech is to look better. Why not mention the current state of the art and show that with a general player we can come close to this results?
This is just benchmark cherry picking and does not reflect real performance or comparison.