|
|
|
|
|
by jgoodwin
4582 days ago
|
|
A bit off topic, but most of statistics also breaks. If you go back to _Mathematical Statistics_ by RA Fisher, early in the last century, and look at his arguments about binning 'big data' into histograms, he has a nice little construction that uses the notion of an 'angle' running through the data set, does a Fourier Series expansion, keeps the 'DC' term from the cosine series, and waves his hand about second order effects. He does estimate them for the sine-like series, and finds for a data set of size N=1 Trillion it might be a 10% effect. The only remnant of this whole proceeding in modern lore (and even Ph.D. statisticians may not have heard of it) is Sheppard's correction for equal class-interval histograms: http://mathworld.wolfram.com/SheppardsCorrection.html But of course when your datasets start to be 1 billion rows routinely, 10% effects a mere 3 orders of magnitude away in the size of the dataset should start to make you nervous. Moral: once you get a billion data points of anything or so, it's time to redo the Maths, very very carefully. |
|
What statistic is being calculated for the N=1 Trillion dataset? And what is the way of calculating that would be off by 10%?