Hacker News new | ask | show | jobs
by spgorbatiuk 9 hours ago
Not sure if I got the question right, but there are benchmarks like SWE pro and stuff. There's whole another debate whether you can trust it or not, and whether the labs are training on those benchmarks, but that's one way to measure that.

Other than benchmarks, I'd say that's your own test suite

1 comments

i would never trust benchmarks tbh most of the new model releases do benchmaxxing