Hacker News new | ask | show | jobs
by basic_stat 4400 days ago
> During the final event, after playing 64 games against Komodo, Stockfish won with the score of 35½-28½. No doubt is further allowed: Stockfish is the best chess player ever!

From a statistical point of view, this isn't actually significant, despite the fact that draws help reduce the variance.

45 of those games are draws, leaving a 13-6 score in favor of Stockfish. Considering a null hypothesis of a binomial distribution with n=19 and equal chance of winning, the two-sided p-value for that score is 0.115. Unless you already have strong evidence that Stockfish is better than Komodo, you shouldn't conclude anything about which one is best.

1 comments

But it fundamentally _isn't_ binomial across 19 games, because of draws. You can't just ignore draws from the analysis, to do so is terrible application of statistics.
Once you condition on the number of draws, you do get that binomial distribution.

Suppose you have a coin, which gives a random outcome X. But you can only observe the outcome of X when another independent binary random variable Y is true. How can you tell if X is biased? Since X and Y are independent, the observations where Y is false are irrelevant since they don't tell you anything about X. So you just keep the observations where Y is true, and from there you can apply a binomial statistical test to the observations of X.

[ In case you're wondering whether applying statistical tests to variable sample sizes is valid, the answer is yes: a p-value is a uniform random variable from the set of observables (augmented by a continuous random variable, since our set of observables is discrete) to [0,1]. Our p-value is a mixture of p-values on smaller sample sizes, so it is still uniform. ]

This is exactly what happens here: consider a random outcome {win,lose,draw}. If you don't have a draw, let Y be true and X be the outcome of the game. If you have a draw, let Y be false and X be a random coin with the same distribution as for non-drawn games. Then X and Y are independent random variables and the above applies.

Informally: draws are not useful information in determining whether there are more wins than losses.

I'm not sure that discounting draws is the right thing to do either. For example, Petrosian was not the strongest attacking player but was very, very tough to beat.

This also calls into the question the notion of "strongest chess player." Who is strongest, the flashy attacking player that wins half the time and loses the other half, or the stonewall that poses little threat but that you can never beat?

The first analysis isn't terrible. It gets the important points right (that the draws are not evidence of difference in skill) and moves on to the remaining evidence (difference in wins to losses).

The draws are at best evidence towards equality (not against it). Allow them to vary and the likelihood of seeing a difference of 9 wins in 64 games with 45 draws moves up to 0.13 or 13% (when we assume the two players are identical, an appropriate null hypothesis) (even less significant). So in about one tournament in 8 you would expect this much of a lead, even if it was one algorithm playing itself. So from one tournament we say it is likely the one algorithm is in fact better, but it doesn't rise to the standard of being statistically significant.

<code>

# R code to empirically estimate two-sided probablity of

# seeing a lead of 9 games when 64 games are played

# and the assumed probability of a draw is 45/64

# with the null assumption win/loss odds are equal

simulate <- function(nplay,ndraw) {

   sample(c('w','d','l'),size=nplay,replace=TRUE,

      prob=c((nplay-ndraw)/2,ndraw,(nplay-ndraw)/2)/nplay)
}

wldiff <- function(v) { abs(sum(v=='w')-sum(v=='l')) }

set.seed(350920)

stats <- replicate(10000,wldiff(simulate(64,45)))

print(sum(stats>=13-6)/length(stats))

## [1] 0.1341

</code>

(it is weird that somebody, not me, created a throw-away account to make the original comment. likely they are involved in chess development, or know how quickly stat discussions go sideways)

(it is weird that somebody, not me, created a throw-away account to make the original comment. likely they are involved in chess development, or know how quickly stat discussions go sideways)

I'm not involved in chess, I just don't like long-term accounts.

You get a slightly different p-value because the ordering you chose is slightly different from mine. Compared to mine, it favors matchups where the draw probability is low.

I get you. I set the draw probability to as observed, whereas you set the number of draws to as observed. But really I just meant to point out you were right (while giving a simulation that didn't involve explicit conditioning).