| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by epups 954 days ago
	Yes. I see people make this mistake time and again when evaluating LLMs. For a proper comparison, it's not enough to simply throw less than a hundred questions at it and point to a single digit difference. Not to mention that LLMs have some inherent randomness, so even if you passed the exact same tasks to the same model you would expect some variance. I see a lot of room of improvement in how we apply statistics to understanding LLM performance.