|
|
|
|
|
by flashdesk
56 days ago
|
|
I like this kind of benchmark, especially since it uses problems that are harder to overfit to. That said, single-attempt results are a bit hard to read into. For anything code-like, things like retries, test feedback, or just letting the model iterate tend to change the outcome quite a bit. |
|