| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by itamarcode 317 days ago
	Unlike most SWE bench submissions, Qodo Command one uses the product directly. I think that the next step is getting an official "checked" mark by the SWE bench team

1 comments

whymauri 317 days ago

I feel like the bash only SWE Bench Verified (a.k.a model + mini-swe-agent) is the closest thing to measuring the inherent ability of the model vs. the scaffolding.

https://github.com/SWE-agent/mini-swe-agent

link

NitpickLawyer 317 days ago

There's swe-rebench, where they take "bugs/issues" by date, and you can drag a slider on their top scores to see issues solved after the model was released (obviously only truly working for open models).

link