| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by n2d4 500 days ago
	Because you're not testing "will a user click the left or right button" (for which asking a thousand users to click a button would be a pretty good estimation), you're testing "which response is preferred". If 10% of people just click based on how fast the response was because they don't want to read both outputs, your p-value for the latter hypothesis will be atrocious, no matter how large the sample is.

2 comments

throwaway287391 500 days ago

Yes, I am assuming they evaluated the models in good faith, understand how to design a basic user study, and therefore when they ran a study intended to compare the response quality between two different models, they showed the raters both fully-formed responses at the same time, regardless of the actual latency of each model.

link

n2d4 500 days ago

I would recommend you read the comment that started this thread then, because that's the context we're talking about: https://news.ycombinator.com/item?id=42891294

link

throwaway287391 500 days ago

I did read that comment. I don't think that person is saying they were part of the study that OpenAI used to evaluate the models. They would probably know if they had gotten paid to evaluate LLM responses.

But I'm glad you pointed that out, I now suspect that is responsible for a large part of the disagreement between "huh? a statistically significant blind evaluation is a statistically significant blind evaluation" vs "oh, this was obviously a terrible study" repliers is due to different interpretations of that post. Thanks. I genuinely didn't consider the alternative interpretation before.

link

radlad 500 days ago

> If 10% of people just click based on how fast the response was

Couldn't this be considered a form of preference?

Whether it's the type of preference OpenAI was testing for, or the type of preference you care about, is another matter.

link

n2d4 500 days ago

Sure, it could be, you can define "preference" as basically anything, but it just loses its meaning if you do that. I think most people would think "56% prefer this product" means "when well-informed, 56% of users would rather have this product than the other".

link