| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by maeil 714 days ago
	Not really. There's a hundred benchmarks, but all of them suffer from the same issues. They're rated by other LLMs, and the tasks are often too simple and similar to each other. The hope is that just gathering enough of these benchmarks means you get a representative test suite, but in my view we're still pretty far off.