| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by flashdesk 56 days ago
	I like this kind of benchmark, especially since it uses problems that are harder to overfit to. That said, single-attempt results are a bit hard to read into. For anything code-like, things like retries, test feedback, or just letting the model iterate tend to change the outcome quite a bit.