| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jfim 4819 days ago

As mentioned, one should really be using a kernel density plot instead of a histogram, except when there are already classes in the data.

In R, one can simply do:

  library("ggplot2")
  library("datasets")
  ggplot(faithful, aes(x=eruptions)) + geom_density() + geom_rug()

which gives a chart like this (http://jean-francois.im/temp/eruptions-kde.png). Contrast with:

  ggplot(faithful, aes(x=eruptions)) + geom_histogram(binwidth=1)

which gives a chart like this (http://jean-francois.im/temp/eruptions-histogram.png).

Edit: Other plots mentioned in this discussion:

  ggplot(faithful, aes(x = eruptions)) + stat_ecdf(geom = "step")

Cumulative distribution, as suggested by leot (http://jean-francois.im/temp/eruptions-ecdf.png)

  qqnorm (faithful$eruptions)

Q-Q plot, as suggested by christopheraden (http://jean-francois.im/temp/eruptions-qq.png)

2 comments

xfs 4819 days ago

But then you would have to choose a certain kernel and assume the data conforms to that distribution which isn't always true.

link

stakka 4818 days ago

This is really not true.

A histogram is considered (by statisticians) to be a non-parametric density estimator. Kernel density estimation is also considered a non-parametric density estimator.

The kernel function you use does not depend on the distribution of your data. If you have normal data, you can use an equation to provide the 'optimal' bandwidth in that case, but this is about bandwidth selection and not the kernel itself.

You can also, say, fit a spline to a univariate dataset. We can also call this non-parametric in the sense that the number of knot parameters, etc., can grow with the data size. This doesn't use any probabalistic machinery until you actually 'fit' the spline.

My takeaway from the original post is that you should probably be aware of how things work if you use them, or the defaults might bite you. I like histograms but I don't like bin-size/position optimization algorithms and just use lots of bins, I like kernel density estimates with the data points lightly shown, and in either case you're gonna fool yourself a couple times.

link

jfim 4819 days ago

Indeed, but that estimate is likely to be less misleading in most cases than a histogram(which is just a uniform kernel that is always aligned with bin boundaries).

link

xfs 4819 days ago

One particular parameter of the kernel, bandwidth, can result in highly misleading visualization given arbitrarily chosen values. Here is an example http://en.wikipedia.org/wiki/File:Comparison_of_1D_bandwidth...

The smoothing give unsavvy readers a false sense of accuracy. With histogram they can at least tell it's an approximation.

link

noelwelsh 4819 days ago

Yup. Luckily there are good methods for choosing the bandwidth:

http://www.umiacs.umd.edu/labs/cvl/pirl/vikas/Software/optim...

link

christopheraden 4818 days ago

Big ups on your use of GGplot--best R graphing capabilities around! In response to your update about the QQ Plot, I didn't compare against normality like you did (the article is comparing exponentials, so a normal QQ isn't the best choice). The QQ Plot just compars the quantiles of one distribution to another (could be an ecdf against a hypothesized cdf, or an ecdf against another ecdf...). Essentially, by plotting one set of points against another, I'm suggesting that the empirical distribution of Annie is the same as the empirical distribution of Brian, or any other pairing.

link

jfim 4818 days ago

Good point; I wasn't using the data from the link but what you mention (doing a QQ plot of distribution pairs to check if they are similar) is probably what I should've posted instead of a QQ plot of some other dataset against an ideal normal distribution.

And as you mention, ggplot is seriously awesome.

link