| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vlovich123 188 days ago
	One classic problem in all ML is ensuring the benchmark is representative and that the algorithm isn’t overfitting the benchmark. This remains an open problem for LLMs - we don’t have true AGI benchmarks and the LLMs are frequently learning the benchmark problems without actually necessarily getting that much better in real world. Gemini 3 has been hailed precisely because it’s delivered huge gains across the board that aren’t overfitting to benchmarks.

1 comments

ipaddr 188 days ago

This could be a solved problem. Come up with problems not online and compare. Later use LLMs to sort through your problems and classify between easy-difficult

link

vlovich123 188 days ago

Hard to do for an industry benchmark since doing the test in such a mode requires sending the question to the LLM which then basically puts it into a public training set.

This has been tried multiple times by multiple people and it ends up not doing so great over time in terms of retaining immunity to “cheating”.

link

kalkin 188 days ago

How do you imagine existing benchmarks were created?

link