Hacker News new | ask | show | jobs
by stingraycharles 42 days ago
Problem with these types of benchmarks is that it’s 100% certain the LLM has been trained on all that code already, so they’re all tainted since you don’t know whether it’s just benchmarking recall vs actual reasoning.

Same with SWE-bench and others.

2 comments

I agree it's a potentially big problem, affecting almost any benchmark out there. We discuss it briefly in "Appendix A: Contamination and memorization" https://epoch.ai/blog/mirrorcode-preliminary-results#appendi....

Ideally one would do these benchmarks with held-out proprietary software, but that comes with many practical concerns.

That's a feature not a bug. It doesn't make benchmarking any more meaningful or simple, but being trained to recall patterns is a legitimate goal for a coding agent.
Yes but then the benchmarks need to be presented as "this verifies whether the model can recall this exact same situation and does not actually benchmark any reasoning at all".

This is not the case, they're being presented as "how good is the model at software engineering". E.g. the benchmark in question says this:

"Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. "

When your benchmark is fundamentally embedded extremely well in the training data, such that you're actually just benchmarking "how well do you remember what sqlite looks like" rather than "do you understand all the tradeoffs, risks, design decisions that need to be made to build a bespoke database from scratch".

This is a VERY big caveat that, to me, for a decent part explains the discrepancy between benchmarks and reality.