Hacker News new | ask | show | jobs
by mrow84 4029 days ago
Fitting a distribution to data is pretty common parlance in my experience, and there is even a wikipedia article with a relevant name [0].

I presume that the parallelisation point was with reference to the point made by the article, that the calculation of means and variances can be parallelised, so large datasets can be dealt with efficiently.

Is there something else you are missing?

[0] http://en.wikipedia.org/wiki/Distribution_fitting

1 comments

Okay, from the Wikipedia article, distribution fitting appears to be what I feared it might be.

I'd never do anything like that and would advise others not do also.

Why? Because it is not the least bit clear just what the heck you get.

Next, likely you should not fit at all. Instead, if want to use some distribution with parameters, e.g., Gaussian, uniform, exponential, then just estimate the parameters and not the distribution.

E.g., if you know that the data is independent, identically distributed Gaussian, then take the sample mean and sample variance and let those be the two parameters in the Gaussian distribution. In that case, will know that the expectation and variance of the distribution are the same as in your data, and that's good.

That sample mean and variance are sufficient statistics for the Gaussian is also a biggie.

And look into the situation for the rest of the exponential family.

See also the famous

Paul R. Halmos, "The Theory of Unbiased Estimation", 'Annals of Mathematical Statistics', Volume 17, Number 1, pages 34-43, 1946.

If want to find the variance of a large data set, then how much accuracy do you want? Generally, sample variance from a few hundred numbers will be okay, and then don't need to consider execution on parallel computer hardware.

R. Hamming once wrote, "The purpose of computing is insight, not numbers."

Along that line, finding sample mean and variance of a huge data set promises little or no more "insight" than just sample mean and variance of an appropriately large sample. Of course, we are assuming that the data is independent and identically distributed so that a good sample is easy to find.

I don't know what you inferred from the wiki article, but of course "to fit a gaussian" is to find the parameters describing it, in this case mean and variance.

Look at the "Techniques", the first three are :

"Parametric methods, by which the parameters of the distribution are calculated from the data series.[2] The parametric methods are: method of moments method of L-moments[3] Maximum likelihood method[4]"

which if you work them out you get exactly what you'd expect.