Hacker News new | ask | show | jobs
by christopheraden 4819 days ago
Interesting paradox. I haven't seen that many statisticians using just a histogram when determining whether a certain distribution fits data reasonably. Kernel Density Estimators are a much better choice (for continuous data, like the data in the post), but they are also affected by your choice of bandwidth. When it comes down to it, like going to the doctor, sometimes the best choice is to get a second (or third!) opinion. For what it's worth, drawing a QQ Plot (something I've seen in every statistical consultation I've ever done) reveals the dependent structure of the data immediately and obviously in the form of a perfect linear relationship between any two variables.
2 comments

I think it's foolish to assume to have the magic tool that will instantly give you a meaningful probability distribution that can statistically reproduce arbitrary datasets. Once you've choosen a certain bandwidth (by fixed binning, or choosing a kernel) you've lost the ability to resolve structure finer than this, and you cannot quantify parameters (e.g. the macroscopic view) much larger than that.

But of course, playing around with these parameters will hopefully give you a nice plot, insight into the problem and allow you to propose a proper model describing your data. Then you can fit this model to your data and extract the model parameters more precisely.

And when the distribution width of the toplogical features match your kernel sizes, of course, this PDF will look almost identical as the density plots.

Indeed, although Q-Q plots are very unlikely to be understood by people who don't have a good grasp of statistics, whereas a misleading histogram will be (and probably without knowledge of the caveats behind histograms).
A great point, but therein lies my biggest complaint with the simplification of statistics that I see in the startup world--sometimes the technical details are actually important. As an analogy, while mass-production has given us a car that anyone can operate, we are largely helpless when one breaks down. Complications abound when individuals try to leverage an overly-simplistic view of a subject (raise your hand if you've heard "We are 95% sure the true [...] lies in this confidence interval").

To the credit of the shadier individuals in my profession, this histogram subtlety nicely highlights how it can be quite easy to bend the data to your argument using ad-hoc procedures (KDEs, hists, QQs, boxplots). A carefully chosen bin width, smoothing parameter, or covariate can present a different view of the data than some other parameter/covariate. That's why it's nice to have other statisticians capable of reproducing and disseminating the work.