| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Bjorkbat 264 days ago
	Yeah, I thought about that after I looked at the SWE-bench results. It doesn't make sense that the SWE results are barely an improvement yet somehow the model is a more significant improvement when it comes to long tasks. You'd expect a huge gain in one to translate to the other. Unless the main area of improvement was tools and scaffolding rather than the model itself.