| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by throwaway287391 500 days ago
	Erm, why not? A 0.56 result with n=1000 ratings is statistically significantly better than 0.5 with a p-value of 0.00001864, well beyond any standard statistical significance threshold I've ever heard of. I don't know how many ratings they collected but 1000 doesn't seem crazy at all. Assuming of course that raters are blind to which model is which and the order of the 2 responses is randomized with every rating -- or, is that what you meant by "poorly designed"? If so, where do they indicate they failed to randomize/blind the raters?

2 comments

godelski 500 days ago

  > If so, where do they indicate they failed to randomize/blind the raters?

  Win rate if user is under time constraint

This is hard to read tbh. Is it STEM? Non-STEM? If it is STEM then this shows there is a bias. If it is Non-STEM then this shows a bias. If it is a mix, well we can't know anything without understanding the split.

Note that Non-STEM is still within error. STEM is less than 2 sigma variance, so our confidence still shouldn't be that high.

link

n2d4 500 days ago

Because you're not testing "will a user click the left or right button" (for which asking a thousand users to click a button would be a pretty good estimation), you're testing "which response is preferred".

If 10% of people just click based on how fast the response was because they don't want to read both outputs, your p-value for the latter hypothesis will be atrocious, no matter how large the sample is.

link

throwaway287391 500 days ago

Yes, I am assuming they evaluated the models in good faith, understand how to design a basic user study, and therefore when they ran a study intended to compare the response quality between two different models, they showed the raters both fully-formed responses at the same time, regardless of the actual latency of each model.

link

n2d4 500 days ago

I would recommend you read the comment that started this thread then, because that's the context we're talking about: https://news.ycombinator.com/item?id=42891294

link

throwaway287391 500 days ago

I did read that comment. I don't think that person is saying they were part of the study that OpenAI used to evaluate the models. They would probably know if they had gotten paid to evaluate LLM responses.

But I'm glad you pointed that out, I now suspect that is responsible for a large part of the disagreement between "huh? a statistically significant blind evaluation is a statistically significant blind evaluation" vs "oh, this was obviously a terrible study" repliers is due to different interpretations of that post. Thanks. I genuinely didn't consider the alternative interpretation before.

link

radlad 500 days ago

> If 10% of people just click based on how fast the response was

Couldn't this be considered a form of preference?

Whether it's the type of preference OpenAI was testing for, or the type of preference you care about, is another matter.

link

n2d4 500 days ago

Sure, it could be, you can define "preference" as basically anything, but it just loses its meaning if you do that. I think most people would think "56% prefer this product" means "when well-informed, 56% of users would rather have this product than the other".

link