Hacker News new | ask | show | jobs
by phamilton 3302 days ago
For those not too familiar with professional baseball, there are 162 games in a season. A player starting every game would likely see 600+ at bats. A starting pitcher pitches every 5-6 days and will face 600+ batters per season. League wide there are about 125k outs recorded per season on 700k pitches.

That's a ton of data. It allows you to ask questions like "What is the likelihood that a curveball is thrown for a strike on a 3-0 count in the 8th inning?" and get statistically significant results.

3 comments

For those wanting to play w/ the data, there are a lot of resources [0]. I personally have combined older retrosheet data [1] with modern MLB data to some neat uses, not the least of which to try out tech like Druid (big data, live slicing, etc). E.g. If you wanted data from Sunday's Houston vs Texas game, GDX has tons of XML for parsing at [2]. There are plenty of guides that tell you what is what of course. It has been on my mind to develop a tensorflow graph trained w/ existing data to help me win some FanDuel/DraftKings money, but I haven't as of yet (and I should note the MLB data has restrictions against bulk or commercial use).

0 - https://github.com/baseballhackday/data-and-resources/wiki/R...

1 - http://retrosheet.org/

2 - http://gdx.mlb.com/components/game/mlb/year_2017/month_06/da...

Thank you so much for these links. Recently I've thought about building something data viz and stats related using baseball stats.

I know MLBAM has a bunch of data they keep to themselves, but I should definitely be able to find something to play with here. Many thanks for sharing!

Wow, thanks for the resources!
> It allows you to ask questions like "What is the likelihood that a curveball is thrown for a strike on a 3-0 count in the 8th inning?" and get statistically significant results.

Does it though? medicine is learning that when you do data mining the bar for statistical significance needs to be much higher. When you do the traditional hypothesis/experiment loop and a result hits 95% likely that is good enough. However when you data mine from millions of data points anything interesting is much more likely to be coincidence. Modern statistics is trying to figure out how to handle this.

Data mining is good for generating hypothesis that you can then test with a controlled experiment. It can be used in meta-analysis to find small effects that were not significant in individual experiments but when you combine them they are significant. However it is dangerous to use it alone.

In more depth, if you look at 100 data points and find 5 things that meet the 95% bar - that is 5 out of 100, odds are that all of them are false positives since you studied 100 things and found 5. (this is a gross simplification of things like p-values, real statisticians will cry about it, but the average person has a chance of understanding)

My impression with experience in bioinformatics is that these issues are likely not shared.

Biological systems are highly variable and absurdly complex, and most datasets come with a host of confounding factors. In comparison, baseball is extremely uniform, and the variability that does exist can often be quantified effectively, and more importantly, often is. Biologists can dream of such high-quality and thorough data, but that's a long way off. This means that the analysis used is quite different.

Genomic data depends greatly on the material used and the methods used. For instance, even with consistent genotypes and identical library preparation, if you collected your RNA a few hours later in the day, you now have a host of circadian changes to contend with that confound your analysis. No one can effectively keep track of all the confounding factors. This means that most analysis needs to be done with direct controls, biological replicates, etc.

In terms of actual analysis, I think the problem is somewhat overstated in your assessment. There are good statistical methods to adjust for multiple comparisons, and the field has largely caught-up to the biggest issues. This was perhaps more accurate 5 years ago, and was mostly the result of poor statistical literacy.

It's not just this season, though. For many types of baseball statistics, full records are available going back decades, and for some statistics there's literally a century's worth of information. Which allows us to ask questions about trends over time and be pretty confident that the answers are the answers, and not just misinterpreted noise.

People tend to underestimate just how much information there is on baseball, and how well-kept it has been.

That's 162 games per team - it's 2,430 games for all of MLB. Compare that with NFL's 16 games per team, or 256 games league wide.

(Edited to fix NFL games per team per season)

Er, that's 16 games per team for the NFL.

Your point is still valid, though.

Or 19 if you're the pats