|
|
|
|
|
by marlburrow
59 days ago
|
|
The "private benchmarks" suggestion comes up every time, but I think
there's a more interesting axis: benchmarks built on top of
already-public, already-stable test instruments. SWE-bench is
fundamentally a corpus that lives on GitHub — once it ships, it leaks
into training data automatically. Benchmarks built on contested
qualitative instruments (psych tests, opinion surveys) have a different
contamination profile because the correct answer doesn't exist in the
training corpus to memorize — only the question does. That doesn't help for measuring coding ability specifically (you
fundamentally need a code-correctness oracle), but for capability axes
where the "answer" is a stated position rather than a verifiable fact,
public + stable can still be useful. The SWE-bench problem isn't really
"public", it's "public + has a fixed correct answer". |
|