|
|
|
|
|
by stingraycharles
42 days ago
|
|
Problem with these types of benchmarks is that it’s 100% certain the LLM has been trained on all that code already, so they’re all tainted since you don’t know whether it’s just benchmarking recall vs actual reasoning. Same with SWE-bench and others. |
|
Ideally one would do these benchmarks with held-out proprietary software, but that comes with many practical concerns.