Y
Hacker News
new
|
ask
|
show
|
jobs
by
mordae
26 days ago
This is a terrible benchmark. It literally tests the models on their ability to track shifting line numbers. If they cannot keep up, no amount of abstract reasoning can redeem them.
1 comments
lordmauve
26 days ago
Where did you get that idea? It uses mini-swe-agent, same as SWE-Bench.
https://github.com/datacurve-ai/deep-swe
link
https://github.com/datacurve-ai/deep-swe