Hacker News new | ask | show | jobs
by jphoward 956 days ago
The problem is the discussed results are comparing proportions of a relatively small number - 67 questions. If you model this as a binomial distribution, then 62/67 which GPT4-turbo got gives a 95% confidence interval of the 'true' performance of 83.4% to 97.5%, ie it comfortably includes the proportion that GPT4 achieved (64/67=95.5%).

I think the evidence from these tests are not strong enough to draw conclusions from.

3 comments

Yes. I see people make this mistake time and again when evaluating LLMs. For a proper comparison, it's not enough to simply throw less than a hundred questions at it and point to a single digit difference. Not to mention that LLMs have some inherent randomness, so even if you passed the exact same tasks to the same model you would expect some variance.

I see a lot of room of improvement in how we apply statistics to understanding LLM performance.

I’m not surprised, most people can’t even tell the median from the mean.
> I think the evidence from these tests are not strong enough to draw conclusions from.

I've used gpt4 turbo for some coding problems yesterday. It was worse. That's enough to draw conclusions for me.