|
|
|
|
|
by sigmoid10
310 days ago
|
|
Benchmarks that execute code are to some degree the only thing where you can automate testing at scale without humans in the loop, but even that has its caveats [1]. Regardless, when your output is natural language text (as is in this case), there is simply no viable alternative to measure accuracy economically. There is frankly no argument to be had here, because this is simply not achievable with current technology. [1] https://openai.com/index/introducing-swe-bench-verified/ |
|