| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by criemen 324 days ago
	Running evals aren't the problem, the problem is acquiring or building a high-quality, non-contaminated dataset. https://arxiv.org/abs/2506.12286 makes a very compelling case that swebench (and in extension, anything that's based on public source code) is most likely overestimating your agents actual capabilities.