Hacker News new | ask | show | jobs
by xfs 4819 days ago
But then you would have to choose a certain kernel and assume the data conforms to that distribution which isn't always true.
2 comments

This is really not true.

A histogram is considered (by statisticians) to be a non-parametric density estimator. Kernel density estimation is also considered a non-parametric density estimator.

The kernel function you use does not depend on the distribution of your data. If you have normal data, you can use an equation to provide the 'optimal' bandwidth in that case, but this is about bandwidth selection and not the kernel itself.

You can also, say, fit a spline to a univariate dataset. We can also call this non-parametric in the sense that the number of knot parameters, etc., can grow with the data size. This doesn't use any probabalistic machinery until you actually 'fit' the spline.

My takeaway from the original post is that you should probably be aware of how things work if you use them, or the defaults might bite you. I like histograms but I don't like bin-size/position optimization algorithms and just use lots of bins, I like kernel density estimates with the data points lightly shown, and in either case you're gonna fool yourself a couple times.

Indeed, but that estimate is likely to be less misleading in most cases than a histogram(which is just a uniform kernel that is always aligned with bin boundaries).
One particular parameter of the kernel, bandwidth, can result in highly misleading visualization given arbitrarily chosen values. Here is an example http://en.wikipedia.org/wiki/File:Comparison_of_1D_bandwidth...

The smoothing give unsavvy readers a false sense of accuracy. With histogram they can at least tell it's an approximation.

Yup. Luckily there are good methods for choosing the bandwidth:

http://www.umiacs.umd.edu/labs/cvl/pirl/vikas/Software/optim...