Hacker News new | ask | show | jobs
by stevenwu 3311 days ago
Question to spark discussion and for me to fill potential gaps in my knowledge, not to criticize the article as I very much appreciate the transparency and build-up from the simple naive initial approach to the final approach used in production:

Is anyone else bothered by the claim that there "is a 100% chance that the new version is better than the current one" shown by using bootstrap? Maybe I've just never come across such a use of bootstrap through my encounters with statistics. I know it as a tool for resampling from a population to build up properties of your estimator (mean, variance, what have you) when all you have is a dataset and no clue about the actual distribution. When I saw bootstrap with that probabilistic claim, I thought the author would calculate a bootstrapped (100-x)% confidence interval for both the current and the new weights: and if the intervals didn't overlap with one another then you can claim with (100-x)% certainty that one is better than the other. But the author creates a new statistic that is a function of both datasets; Z_i = 1 if new is better than current on iteration i (on a random subset of data) else 0, and for all N=10000 iterations Z_i = 1. The chance/probabilistic claim made of new being better than current is based on the fact that no variation was seen on Z_i (I'm also kind of skeptical that out of so many iterations with random subsets that each time the new weights were better than the current). I think at most you can say that you simulated subsets of the data and 100% of the time new > current; the current claim leads me to believe there's inference that isn't there.

Maybe I should just ask one of my past stats profs. Open to someone enlightening me.

1 comments

A quantile-based confidence interval from bootstrapping can yield a 100% confidence interval that does not contain 0, i.e., with 100% of cases positive/negative. But that does not (necessarily) mean that there is a 100% chance that the new version is better than the old one. Confidence intervals are not Bayesian credible intervals and cannot be treated as such. (That said, making some certain assumptions about the underlying model can in some times allow one to treat nonparametric bootstraps in such a way.)
Right. The author finds 100% of the time for his current dataset but makes a statement that implies some certainty or inference on future cases. Like taking 100 men, 100 women and finding that 100 randomly matched pairs had the man taller than the woman 100 times, and making the claim that there is a 100% chance that men are taller than women.

The more I type the more I realize how pedantic this is, but we're emphasized in stats to pay extra attention to the conclusions we draw from the data we analyze.