Basically, we use a two-phase test to maximize testing resources. First a short time control test (15s/game), using more lenient SPRT termination criteria, then, a long time control (60s/game) test using more stringent criteria. That combined with setting the SPRT bounds to allow us to measure 2-3 ELO improvements has allowed the progress of Stockfish to be almost only improvements. Previously when developing an engine, you'd make 10 changes, and if you were lucky, 2 or 3 would be good enough to make up for the other bad or neutral ones.
If you look at the graphs on http://www.sp-cc.de/, you can see that it just keeps getting better, one small improvement at a time.
Here is the announcement of fishtest on the talkchess forum: http://talkchess.com/forum/viewtopic.php?t=47885&highlight=s...
Initial discussion of the introduction of SPRT into fishtest, which led to a dramatic increase in our ability to measure improvements in self-play, in a statistically sound manner: https://groups.google.com/forum/?fromgroups=#!searchin/fishc...
SPRT background here: https://en.wikipedia.org/wiki/Sequential_probability_ratio_t...
Basically, we use a two-phase test to maximize testing resources. First a short time control test (15s/game), using more lenient SPRT termination criteria, then, a long time control (60s/game) test using more stringent criteria. That combined with setting the SPRT bounds to allow us to measure 2-3 ELO improvements has allowed the progress of Stockfish to be almost only improvements. Previously when developing an engine, you'd make 10 changes, and if you were lucky, 2 or 3 would be good enough to make up for the other bad or neutral ones.
If you look at the graphs on http://www.sp-cc.de/, you can see that it just keeps getting better, one small improvement at a time.