|
|
|
|
|
by throwaway287391
500 days ago
|
|
Erm, why not? A 0.56 result with n=1000 ratings is statistically significantly better than 0.5 with a p-value of 0.00001864, well beyond any standard statistical significance threshold I've ever heard of. I don't know how many ratings they collected but 1000 doesn't seem crazy at all. Assuming of course that raters are blind to which model is which and the order of the 2 responses is randomized with every rating -- or, is that what you meant by "poorly designed"? If so, where do they indicate they failed to randomize/blind the raters? |
|
Note that Non-STEM is still within error. STEM is less than 2 sigma variance, so our confidence still shouldn't be that high.