| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yaodub 2 days ago
	SWE-Bench measures single tasks in isolation. In a real loop the model usually loses track of what I was trying to do long before code quality becomes the issue.