|
|
|
|
|
by jphoward
956 days ago
|
|
The problem is the discussed results are comparing proportions of a relatively small number - 67 questions. If you model this as a binomial distribution, then 62/67 which GPT4-turbo got gives a 95% confidence interval of the 'true' performance of 83.4% to 97.5%, ie it comfortably includes the proportion that GPT4 achieved (64/67=95.5%). I think the evidence from these tests are not strong enough to draw conclusions from. |
|
I see a lot of room of improvement in how we apply statistics to understanding LLM performance.