| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by marlburrow 59 days ago
	The "private benchmarks" suggestion comes up every time, but I think there's a more interesting axis: benchmarks built on top of already-public, already-stable test instruments. SWE-bench is fundamentally a corpus that lives on GitHub — once it ships, it leaks into training data automatically. Benchmarks built on contested qualitative instruments (psych tests, opinion surveys) have a different contamination profile because the correct answer doesn't exist in the training corpus to memorize — only the question does. That doesn't help for measuring coding ability specifically (you fundamentally need a code-correctness oracle), but for capability axes where the "answer" is a stated position rather than a verifiable fact, public + stable can still be useful. The SWE-bench problem isn't really "public", it's "public + has a fixed correct answer".