|
|
|
|
|
by nimar
329 days ago
|
|
v interesting benchmark, looking forward to see it evolve over time. actually surprisingly good results already. maybe add a couple harder APIs (or more complex queries) as well where current models overwhelmingly fail? that way, we can still measure models in a couple of years against the current ones. also adding o3 and for reference the model(s) used by superglue in this benchmark would be interesting. |
|