Hacker News new | ask | show | jobs
by jbclements 3130 days ago
Okay, this is going to sound mean, but this is like the definition of p-hacking. When you look at 30 values, you simply can't be surprised that one of them is lower than the mean, with a p value around 1/20th. Use something like a Bonferroni correction, to get a significance level of 1/600. Does the result still stand up? In fact, there's an xkcd about this very topic. https://www.xkcd.com/882/
2 comments

I completely agree that it's important to take this kind of thing into account when approaching a problem like this. As I say in the post, "There are 31 days and one of them has to be smallest. Maybe the 11th isn’t an outlier; it’s just on the smaller end and our eyes are picking up on a pattern that doesn’t exist."

I'll admit that a straight p-value is not the appropriate statistic here. I don't even know how what the perfect statistic for this problem is. A Bonferroni correction is not enough because not only is the 11th of the month the lowest for a particular year--it's the lowest for every year.

I was convinced that this was real when I looked at the first line graph of the post. The 11th is the lowest either every year or almost every year, being 3-5 standard deviations below the mean for the bulk of the last 200 years. That just can't happen by chance no matter how you slice it.

If anyone knows the proper way to calculate a statistic on something like this, I would love to hear about it.

The significance is absolutely robust. In the early 20th century the measured counts of "<month> 11th" are many standard deviations off.