|
|
|
|
|
by epups
954 days ago
|
|
Yes. I see people make this mistake time and again when evaluating LLMs. For a proper comparison, it's not enough to simply throw less than a hundred questions at it and point to a single digit difference. Not to mention that LLMs have some inherent randomness, so even if you passed the exact same tasks to the same model you would expect some variance. I see a lot of room of improvement in how we apply statistics to understanding LLM performance. |
|