Hacker News new | ask | show | jobs
by yummyfajitas 5134 days ago
Ok, redid it with 5% opacity. You are correct, at that level, the density distribution in x is qualitatively visible.

http://i.imgur.com/geKbT.png

I disagree that the hexbin has lower resolution. The color dimension allows the human eye to easily differentiate regions having similar densities (e.g., 70 vs 50). The difference between deep red and orange is a lot bigger than the difference between dark blue and slightly less dark blue.

The hexbin has lower spatial resolution, it's true, but I'd argue that the spatial resolution you get in a scatterplot is illusory. It doesn't reflect the underlying probability distribution, only the particular sample.

3 comments

> The difference between deep red and orange is a lot bigger than the difference between dark blue and slightly less dark blue.

That may be true, but is it better? Does the scatterplot underemphasize differen densities, or does the density plot overemphasize them? I think the scatterplot is more intuitive. There are 21 equal steps between 0 and 100% black. Two points are twice as dark as one, four points are twice as dark as two. Darker means more, lighter means less.

Compare that to shifting from blue to red. Does the shift from orange to red indicate the same density difference as the shift from blue to orange? To decide you need to consult the color scale. The scatterplot is intuitive, and requires no scale.

> The hexbin has lower spatial resolution, it's true, but I'd argue that the spatial resolution you get in a scatterplot is illusory. It doesn't reflect the underlying probability distribution, only the particular sample.

The spatial resolution of a scatterplot represents empirical reality. Each point corresponds to a single observation, with no probability distribution implied or imposed. The density plot, in contrast, imposes a probability distribution, which may or may not reflect the true distribution of the population. The larger the bins, the more likely the displayed pattern is 'illusory'.

The spatial resolution of a scatterplot represents empirical reality. Each point corresponds to a single observation, with no probability distribution implied or imposed.

tl;dr; I'm a Bayesian, you are a Frequentist (at least with regards to plotting).

I don't know what this has to do with bayes vs frequentist. I am not arguing that the data do not have a probabilty distribution. I am arguing that it is better to show all the data when possible, rather than an eye-catching but lossy summary.
Just a note, but you're using awfully small hexes and large points there. The two plots approach each other asymptotically, especially if you switch to a monochromatic colormap like some here have argued for.

Histograms tend to work best when the data is well understood, while scatterplots are better for samples from an unknown distribution (incl. lattices, multimodals or even double exponentials).

Try also small samples. When the piecewise uniform prior (on both the data and the intensity) is approximately accurate, histograms are far better, as they guide the eye away from nonexistent patterns. But the bandwidth needs to be judiciously set, and often the data transformed.

Clustering is hard to automate.

I have 20-70 data points per hex, at least in the high density regions. While I could have used bigger hexes, I think the difference between a hexbin and a scatterplot is well illustrated.
Fair enough. But your point markers are almost half the size of the hex. That makes color the principal difference.
(It also helps to set edgecolor='none' to remove the black line around each point.)