| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pocketsand 727 days ago
	Why? They’re non-parametric and make zero assumptions of normality.

1 comments

blueflow 727 days ago

How else would you calculate the quartiles to render the boxes?

link

munch117 727 days ago

Count data points in each quartile. You can do that for any sortable data, independent of distribution.

link

blueflow 727 days ago

On second thought, this method makes the outer brackets / whiskers pretty much useless since their position is determined by the largest outliers, which is quite much random.

link

Falkon1313 726 days ago

That's not how they're drawn. Outliers (More than 1.5 times the interquartile range outside the 1st/3rd quartile) are plotted as dots beyond the whiskers. The whiskers go at Q1-1.5×IQR and Q3+1.5×IQR.

link

blueflow 726 days ago

Better is! Look what i was replying to.

link

blueflow 727 days ago

If you do that in your paper, you better write next to the graph that you did that.

link

munch117 727 days ago

Perhaps I expressed myself poorly, and left room for misunderstanding, because I cannot possible imagine that we have any real disagreement on how to compute quartiles.

Any set of numbers I give you, you can compute quartiles for it. There is no algorithm for doing that that breaks down if the numbers don't follow a normal distribution.

link

blueflow 727 days ago

Look at this SVG from wikipedia: https://upload.wikimedia.org/wikipedia/commons/1/1a/Boxplot_...

When you calculate the box plot using normal distribution parameters, the outliers are outside the outer bracket.

If you split the dataset into 4 equal parts, the bracket will be larger because the outliers are still inside it.

The methodologies are not equal.

This thread is the first time i heard people do the "split dataset into 4 quarters" and using that for box plots.

link

ColFrancis 727 days ago

For what it's worth, you've convinced me that my beloved box plots need to be explained if I want to use them again.

The SVG you've provided clearly shows that the box plot splits the data in 4. The interquartile range (IQR) is clearly marked and it even has a comparison for what the standard deviation (variance) measure would be.

Secondly, if the data truly came from a normal distribution, there are no outliers. Outliers are data points which cannot be explained by the model and need to be removed. Unless you have a good reason to exclude the data points they should be included. This is why I like the IQR and the median, they are not swayed by a few wide valued data points. The 1.5*IQR rejection filter I think is lazy and unjustified. Happy to discuss this point further as it is a bug bear of mine.

link

pocketsand 727 days ago

As I'm sure you know, there are a lot of variations on how quantiles are calculated in various software. The 25th percentile, e.g., doesn't always line up with a value in the dataset, so sometimes nearest rank methods are used, otherwise a linearly interpolated data point, where interpolation is done in various ways.

In any event, none of these methods assume normality, or rely on CDFs of a normal curve.

If they did, every box plot would be symmetric.

The fact some people think that boxplots are constructed in such a way is a pretty good reason to take the author's article seriously as for how boxplots are confusing.

link

thaumasiotes 727 days ago

Arguing that nobody who might be professionally expected to look at a box plot can be reasonably expected to understand how box plots are defined doesn't make a compelling case that using them is a good idea.

link

A4ET8a8uTh0 727 days ago

It is actually a fascinating argument that shows how little of what is being decided is based on actual data ( or at least our understanding of it ), but rather that data visualization is being used to push already pre-approved decisions with data being used merely as a 'for' argument.

I agree that if there is an indication that if most professionals don't really know what boxplot is supposed communicate, maybe it should not be used.

link

blueflow 727 days ago

If the method how the plot boxes are calculated is not clear (this thread references at least two different methods), you'll need to explicitly write it down which methods you did use.

link

thaumasiotes 726 days ago

> this thread references at least two different methods

No, as the sidethread comment notes, there is only one way you can compute quartiles. You seem to be arguing that the correct thing to do is to impute them, and that calculating them is such a deviant practice that it would need to be specially remarked on.

link