Hacker News new | ask | show | jobs
by epups 954 days ago
Yes. I see people make this mistake time and again when evaluating LLMs. For a proper comparison, it's not enough to simply throw less than a hundred questions at it and point to a single digit difference. Not to mention that LLMs have some inherent randomness, so even if you passed the exact same tasks to the same model you would expect some variance.

I see a lot of room of improvement in how we apply statistics to understanding LLM performance.