| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nimar 329 days ago

v interesting benchmark, looking forward to see it evolve over time. actually surprisingly good results already.

maybe add a couple harder APIs (or more complex queries) as well where current models overwhelmingly fail?

that way, we can still measure models in a couple of years against the current ones.

also adding o3 and for reference the model(s) used by superglue in this benchmark would be interesting.