Y
Hacker News
new
|
ask
|
show
|
jobs
by
stared
60 days ago
SWE-bench Verified is, at this point, contaminated
https://openai.com/index/why-we-no-longer-evaluate-swe-bench...
So it os hard to tell how much of a model gain is due to skill, and how much - overfitting.