Y
Hacker News
new
|
ask
|
show
|
jobs
by
yaodub
2 days ago
SWE-Bench measures single tasks in isolation. In a real loop the model usually loses track of what I was trying to do long before code quality becomes the issue.