Hacker News new | ask | show | jobs
by flashdesk 56 days ago
I like this kind of benchmark, especially since it uses problems that are harder to overfit to.

That said, single-attempt results are a bit hard to read into. For anything code-like, things like retries, test feedback, or just letting the model iterate tend to change the outcome quite a bit.