| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stingraycharles 9 hours ago
	Ehr, the SWE bench examples are particularly horrible as those are just publicly available historical PRs. So if the models are trained on GitHub data, it will be included. So almost by design that particular benchmark is tainted, and benchmarks recall rather than reasoning.

1 comments

Wow that's worse than I thought, and breaks the number one rule of machine learning: you don't train the model with your test dataset.