Hacker News new | ask | show | jobs
by eterm 4397 days ago
But it fundamentally _isn't_ binomial across 19 games, because of draws. You can't just ignore draws from the analysis, to do so is terrible application of statistics.
2 comments

Once you condition on the number of draws, you do get that binomial distribution.

Suppose you have a coin, which gives a random outcome X. But you can only observe the outcome of X when another independent binary random variable Y is true. How can you tell if X is biased? Since X and Y are independent, the observations where Y is false are irrelevant since they don't tell you anything about X. So you just keep the observations where Y is true, and from there you can apply a binomial statistical test to the observations of X.

[ In case you're wondering whether applying statistical tests to variable sample sizes is valid, the answer is yes: a p-value is a uniform random variable from the set of observables (augmented by a continuous random variable, since our set of observables is discrete) to [0,1]. Our p-value is a mixture of p-values on smaller sample sizes, so it is still uniform. ]

This is exactly what happens here: consider a random outcome {win,lose,draw}. If you don't have a draw, let Y be true and X be the outcome of the game. If you have a draw, let Y be false and X be a random coin with the same distribution as for non-drawn games. Then X and Y are independent random variables and the above applies.

Informally: draws are not useful information in determining whether there are more wins than losses.

I'm not sure that discounting draws is the right thing to do either. For example, Petrosian was not the strongest attacking player but was very, very tough to beat.

This also calls into the question the notion of "strongest chess player." Who is strongest, the flashy attacking player that wins half the time and loses the other half, or the stonewall that poses little threat but that you can never beat?

The first analysis isn't terrible. It gets the important points right (that the draws are not evidence of difference in skill) and moves on to the remaining evidence (difference in wins to losses).

The draws are at best evidence towards equality (not against it). Allow them to vary and the likelihood of seeing a difference of 9 wins in 64 games with 45 draws moves up to 0.13 or 13% (when we assume the two players are identical, an appropriate null hypothesis) (even less significant). So in about one tournament in 8 you would expect this much of a lead, even if it was one algorithm playing itself. So from one tournament we say it is likely the one algorithm is in fact better, but it doesn't rise to the standard of being statistically significant.

<code>

# R code to empirically estimate two-sided probablity of

# seeing a lead of 9 games when 64 games are played

# and the assumed probability of a draw is 45/64

# with the null assumption win/loss odds are equal

simulate <- function(nplay,ndraw) {

   sample(c('w','d','l'),size=nplay,replace=TRUE,

      prob=c((nplay-ndraw)/2,ndraw,(nplay-ndraw)/2)/nplay)
}

wldiff <- function(v) { abs(sum(v=='w')-sum(v=='l')) }

set.seed(350920)

stats <- replicate(10000,wldiff(simulate(64,45)))

print(sum(stats>=13-6)/length(stats))

## [1] 0.1341

</code>

(it is weird that somebody, not me, created a throw-away account to make the original comment. likely they are involved in chess development, or know how quickly stat discussions go sideways)

(it is weird that somebody, not me, created a throw-away account to make the original comment. likely they are involved in chess development, or know how quickly stat discussions go sideways)

I'm not involved in chess, I just don't like long-term accounts.

You get a slightly different p-value because the ordering you chose is slightly different from mine. Compared to mine, it favors matchups where the draw probability is low.

I get you. I set the draw probability to as observed, whereas you set the number of draws to as observed. But really I just meant to point out you were right (while giving a simulation that didn't involve explicit conditioning).