Hacker News new | ask | show | jobs
by bluGill 3304 days ago
> It allows you to ask questions like "What is the likelihood that a curveball is thrown for a strike on a 3-0 count in the 8th inning?" and get statistically significant results.

Does it though? medicine is learning that when you do data mining the bar for statistical significance needs to be much higher. When you do the traditional hypothesis/experiment loop and a result hits 95% likely that is good enough. However when you data mine from millions of data points anything interesting is much more likely to be coincidence. Modern statistics is trying to figure out how to handle this.

Data mining is good for generating hypothesis that you can then test with a controlled experiment. It can be used in meta-analysis to find small effects that were not significant in individual experiments but when you combine them they are significant. However it is dangerous to use it alone.

In more depth, if you look at 100 data points and find 5 things that meet the 95% bar - that is 5 out of 100, odds are that all of them are false positives since you studied 100 things and found 5. (this is a gross simplification of things like p-values, real statisticians will cry about it, but the average person has a chance of understanding)

2 comments

My impression with experience in bioinformatics is that these issues are likely not shared.

Biological systems are highly variable and absurdly complex, and most datasets come with a host of confounding factors. In comparison, baseball is extremely uniform, and the variability that does exist can often be quantified effectively, and more importantly, often is. Biologists can dream of such high-quality and thorough data, but that's a long way off. This means that the analysis used is quite different.

Genomic data depends greatly on the material used and the methods used. For instance, even with consistent genotypes and identical library preparation, if you collected your RNA a few hours later in the day, you now have a host of circadian changes to contend with that confound your analysis. No one can effectively keep track of all the confounding factors. This means that most analysis needs to be done with direct controls, biological replicates, etc.

In terms of actual analysis, I think the problem is somewhat overstated in your assessment. There are good statistical methods to adjust for multiple comparisons, and the field has largely caught-up to the biggest issues. This was perhaps more accurate 5 years ago, and was mostly the result of poor statistical literacy.

It's not just this season, though. For many types of baseball statistics, full records are available going back decades, and for some statistics there's literally a century's worth of information. Which allows us to ask questions about trends over time and be pretty confident that the answers are the answers, and not just misinterpreted noise.

People tend to underestimate just how much information there is on baseball, and how well-kept it has been.