The first mention says "Stockfish 8, level 20" in the paper. This isn't a blog post that you can skim, you need to read the whole thing before critiquing.
That's actually the second mention, the first is when they introduce the games in section 4:
> Today, computer-
playing programs remain consistently super-human, and one of the strongest and most widely-used
programs is Stockfish.
They also go back to referring to it as Stockfish for the rest of the paper.
An analogous situation in my mind would be if AMD released a new CPU and benchmarked it against an Intel CPU, only mentioning once, somewhere in the middle of the paper, that it was a Pentium 4.
This sort of evasiveness around speaking on method limitations, down playing or de-emphasizing related work but boosting senior authors previous work is standard academic fare. It's partly a strategy against novelty nitpickers and results in a net negative for all.
I also suspect part of the reason they chose Stockfish 8 was as a basis of comparison with AlphaZero. Their baselines for Go and poker are also pretty weak so their emphasis is clearly on displaying generality and reduced domain specialized input, not supremacy.
A single algorithm to play perfect and imperfect information games is difficult to achieve. Standard depth limited solvers and self-play RL result in highly exploitable agents. PoG appears to be very strong at Chess, decently strong at Go and decent at Poker (Facebook AI's ReBeL, the strongest prior work in this area, performed better against slumbot). What's unique about PoG is its ability to also play an imperfect information game (Scotland Yard) that has many rounds and a relatively long horizon (although it still has scaling issues).
It really isn't though. Technical papers have conventions, and they following them reasonably. You expect the methods description to be specific, the abstract not to be hyperbolic, and conclusions to be balanced. The general discussion parts are just that, general.
In the methods area they discuss the exact versions and parameters used, and how they compared them.
In the conclusions:
In the perfect information games of chess and Go,PoG performs at the level of human experts or professionals, but can be significantly weaker than specialized algorithms for this class of games, like AlphaZero, when given the same resources.
It would have perhaps been interesting to include a more recent stockfish, but it wouldn't really impact the paper.
> Today, computer- playing programs remain consistently super-human, and one of the strongest and most widely-used programs is Stockfish.
This is just a general effort to describe the present state of things. When they explicitly describe their evaluation process, they are sure to use the version number. They then _immediately_ drop the version number in subsequent usage which is culturally standard in research papers so they don't concern themselves with minute details of every single thing they find themselves redescribing. Believe me, you don't want to read the verbose version of this paragraph.
> In chess, we evaluated PoG against Stockfish 8, level 20 [81] and AlphaZero. PoG(800, 1) was run in training for 3M training steps. During evaluation, Stockfish uses various search controls: number of threads, and time per search. We evaluate AlphaZero and PoG up to 60000 simulations. A tournament between all of the agents was played at 200 games per pair of agents (100 games as white, 100 games as black). Table 1a shows the relative Elo comparison obtained by this tournament, where a baseline of 0 is chosen for Stockfish(threads=1, time=0.1s).
I'd be interested to see that benchmark. A ~3 GHz Pentium 4 sounds like a good reference point for single threaded performance since it's a reasonably modern OoO microarchitecture and reflects the moment that clock scaling stopped.
With a smaller cache, a less efficient branch predictor and only SSE for SIMD, I'd be curious to see the benchmark too but I'd be surprised if it was close.
I don't know if the RAM bandwidth being much lower would have an impact on CPU benchmark though.
I obviously read it, otherwise I wouldn't have known which version they are using. They are banking on others, that do just skim the figures and tables, not noticing their usage of outdated baselines.
> Today, computer- playing programs remain consistently super-human, and one of the strongest and most widely-used programs is Stockfish.
They also go back to referring to it as Stockfish for the rest of the paper.
An analogous situation in my mind would be if AMD released a new CPU and benchmarked it against an Intel CPU, only mentioning once, somewhere in the middle of the paper, that it was a Pentium 4.