Hacker News new | ask | show | jobs
by jfim 4819 days ago
As mentioned, one should really be using a kernel density plot instead of a histogram, except when there are already classes in the data.

In R, one can simply do:

  library("ggplot2")
  library("datasets")
  ggplot(faithful, aes(x=eruptions)) + geom_density() + geom_rug()
which gives a chart like this (http://jean-francois.im/temp/eruptions-kde.png). Contrast with:

  ggplot(faithful, aes(x=eruptions)) + geom_histogram(binwidth=1)
which gives a chart like this (http://jean-francois.im/temp/eruptions-histogram.png).

Edit: Other plots mentioned in this discussion:

  ggplot(faithful, aes(x = eruptions)) + stat_ecdf(geom = "step")
Cumulative distribution, as suggested by leot (http://jean-francois.im/temp/eruptions-ecdf.png)

  qqnorm (faithful$eruptions)
Q-Q plot, as suggested by christopheraden (http://jean-francois.im/temp/eruptions-qq.png)
2 comments

But then you would have to choose a certain kernel and assume the data conforms to that distribution which isn't always true.
This is really not true.

A histogram is considered (by statisticians) to be a non-parametric density estimator. Kernel density estimation is also considered a non-parametric density estimator.

The kernel function you use does not depend on the distribution of your data. If you have normal data, you can use an equation to provide the 'optimal' bandwidth in that case, but this is about bandwidth selection and not the kernel itself.

You can also, say, fit a spline to a univariate dataset. We can also call this non-parametric in the sense that the number of knot parameters, etc., can grow with the data size. This doesn't use any probabalistic machinery until you actually 'fit' the spline.

My takeaway from the original post is that you should probably be aware of how things work if you use them, or the defaults might bite you. I like histograms but I don't like bin-size/position optimization algorithms and just use lots of bins, I like kernel density estimates with the data points lightly shown, and in either case you're gonna fool yourself a couple times.

Indeed, but that estimate is likely to be less misleading in most cases than a histogram(which is just a uniform kernel that is always aligned with bin boundaries).
One particular parameter of the kernel, bandwidth, can result in highly misleading visualization given arbitrarily chosen values. Here is an example http://en.wikipedia.org/wiki/File:Comparison_of_1D_bandwidth...

The smoothing give unsavvy readers a false sense of accuracy. With histogram they can at least tell it's an approximation.

Yup. Luckily there are good methods for choosing the bandwidth:

http://www.umiacs.umd.edu/labs/cvl/pirl/vikas/Software/optim...

Big ups on your use of GGplot--best R graphing capabilities around! In response to your update about the QQ Plot, I didn't compare against normality like you did (the article is comparing exponentials, so a normal QQ isn't the best choice). The QQ Plot just compars the quantiles of one distribution to another (could be an ecdf against a hypothesized cdf, or an ecdf against another ecdf...). Essentially, by plotting one set of points against another, I'm suggesting that the empirical distribution of Annie is the same as the empirical distribution of Brian, or any other pairing.
Good point; I wasn't using the data from the link but what you mention (doing a QQ plot of distribution pairs to check if they are similar) is probably what I should've posted instead of a QQ plot of some other dataset against an ideal normal distribution.

And as you mention, ggplot is seriously awesome.