Hacker News new | ask | show | jobs
by ofirpress 396 days ago
Not sure what you mean by benchmaxxing but we think there's still a lot of useful signals you can infer from SWE-bench-style benchmarking.

We also have SWE-bench Multimodal which adds a twist I haven't seen elsewhere: https://www.swebench.com/multimodal.html

1 comments

I mean that there is the possibility that swe bench is being specifically targeted for training and the results may not reflect real world performance.