Hacker News new | ask | show | jobs
by bluemanshoe 5367 days ago
I would use the Kolmogorov-Smirnov test: http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

It's in scipy.stats.ks_2samp

Results:

Sets| D |p-value

-----------------

A,C |0.275|0.080|

B,C |0.175|0.531|

-----------------

A,D |0.125|0.893|

B,D |0.275|0.080|

-----------------

A,E |0.100|0.983|

B,E |0.300|0.043|

-----------------

A,F |0.300|0.043|

B,F |0.100|0.983|

-----------------

As far as the test goes, if D is small and p is high, you cannot reject the hypothesis that the two datasets came from the same distribution. The p-value is roughly how often, randomly you would get similar looking data assuming the null hypothesis (in this case that they are drawn from the same dataset)

In light of this evidence, if they are not lying to us, and really each of these sets came from and A-like or B-like distribution, I'd say fairly confidently that:

F is B-like

E is A-like

D is A-like and

C is B-like (though with lower confidence)

The box-plot: http://i.imgur.com/epPw7.png seems to confirm.

2 comments

You could probably get away with just using a 2-sample t-test (http://en.wikipedia.org/wiki/Students_t-test#Equal_sample_si...), no? No sense using a non-parametric sledgehammer unless absolutely necessary :).

Actually, just looking at the variance of the columns tells the same story as you've discovered above:

> var(data) A B C D E F 117.19610 20.49239 33.13114 90.62195 115.39044 27.34298

A, D, E are in the same group, B, C, F are in the same group.

Based on the final selections:

there is a 98.3% chance that f is b-like there is a 98.3% chance that e is a-like there is a 89.3% chance that d is a-like there is a 53.1% chance that c is b-like

If these are multiplied together, it appears that there is only a 45.8% chance that they are all classified correctly?

You're asking a good question -- but you know a lot more than what you write above.

The main thing is, you know that C, D, E, and F came from either A or B. The p-values above don't account for that; they just say what's the chance, due to random fluctuation, that a sample could have come from from the same source as A.

That's reflected in the fact that the pairs of p-values don't add to one! (Like (A,C) and (B,C) in the table above.)

You also implicitly know that at least one of {C,D,E,F} is A-like and one is B-like (otherwise there would not be a problem). So even if you know P(X and Y have same source) for all (X,Y), which you don't, you couldn't multiply them.

Finally, the p-value returned by the KS test will underestimate the true probability of discrepancy. This is because it's only looking at one thing, the max value of a CDF difference. The significant differences between the distributions may lie elsewhere, like in the tails, and the KS test is known to be relatively insensitive to tail behavior. (Although at n=40 you won't be able to see far into the tails.)

There are a host of other tests that use the same idea (empirical CDF difference) but weight differently. Some can be more effective than the KS test if you're looking for certain types of difference. Here's an OK overview, albeit for the goal of assessing normality:

http://www.instatmy.org.my/downloads/e-jurnal%202/3.pdf

In a real problem, it's always a good idea to use more empirical-cdf tests than just the KS test, to compare variances and other moments as some people in the thread have done, and to make histogram or CDF plots -- especially if you're in just 1 dimension and the plots are easy to interpret.