Hacker News new | ask | show | jobs
by jmount 4399 days ago
The first analysis isn't terrible. It gets the important points right (that the draws are not evidence of difference in skill) and moves on to the remaining evidence (difference in wins to losses).

The draws are at best evidence towards equality (not against it). Allow them to vary and the likelihood of seeing a difference of 9 wins in 64 games with 45 draws moves up to 0.13 or 13% (when we assume the two players are identical, an appropriate null hypothesis) (even less significant). So in about one tournament in 8 you would expect this much of a lead, even if it was one algorithm playing itself. So from one tournament we say it is likely the one algorithm is in fact better, but it doesn't rise to the standard of being statistically significant.

<code>

# R code to empirically estimate two-sided probablity of

# seeing a lead of 9 games when 64 games are played

# and the assumed probability of a draw is 45/64

# with the null assumption win/loss odds are equal

simulate <- function(nplay,ndraw) {

   sample(c('w','d','l'),size=nplay,replace=TRUE,

      prob=c((nplay-ndraw)/2,ndraw,(nplay-ndraw)/2)/nplay)
}

wldiff <- function(v) { abs(sum(v=='w')-sum(v=='l')) }

set.seed(350920)

stats <- replicate(10000,wldiff(simulate(64,45)))

print(sum(stats>=13-6)/length(stats))

## [1] 0.1341

</code>

(it is weird that somebody, not me, created a throw-away account to make the original comment. likely they are involved in chess development, or know how quickly stat discussions go sideways)

1 comments

(it is weird that somebody, not me, created a throw-away account to make the original comment. likely they are involved in chess development, or know how quickly stat discussions go sideways)

I'm not involved in chess, I just don't like long-term accounts.

You get a slightly different p-value because the ordering you chose is slightly different from mine. Compared to mine, it favors matchups where the draw probability is low.

I get you. I set the draw probability to as observed, whereas you set the number of draws to as observed. But really I just meant to point out you were right (while giving a simulation that didn't involve explicit conditioning).