Hacker News new | ask | show | jobs
by nimar 329 days ago
v interesting benchmark, looking forward to see it evolve over time. actually surprisingly good results already.

maybe add a couple harder APIs (or more complex queries) as well where current models overwhelmingly fail?

that way, we can still measure models in a couple of years against the current ones.

also adding o3 and for reference the model(s) used by superglue in this benchmark would be interesting.