Hacker News new | ask | show | jobs
by jgoodwin 4582 days ago
A bit off topic, but most of statistics also breaks.

If you go back to _Mathematical Statistics_ by RA Fisher, early in the last century, and look at his arguments about binning 'big data' into histograms, he has a nice little construction that uses the notion of an 'angle' running through the data set, does a Fourier Series expansion, keeps the 'DC' term from the cosine series, and waves his hand about second order effects. He does estimate them for the sine-like series, and finds for a data set of size N=1 Trillion it might be a 10% effect.

The only remnant of this whole proceeding in modern lore (and even Ph.D. statisticians may not have heard of it) is Sheppard's correction for equal class-interval histograms:

http://mathworld.wolfram.com/SheppardsCorrection.html

But of course when your datasets start to be 1 billion rows routinely, 10% effects a mere 3 orders of magnitude away in the size of the dataset should start to make you nervous.

Moral: once you get a billion data points of anything or so, it's time to redo the Maths, very very carefully.

1 comments

I am a PhD statistician and I can't understand your comment.

What statistic is being calculated for the N=1 Trillion dataset? And what is the way of calculating that would be off by 10%?