Hacker News new | ask | show | jobs
by tadamcz 44 days ago
I agree it's a potentially big problem, affecting almost any benchmark out there. We discuss it briefly in "Appendix A: Contamination and memorization" https://epoch.ai/blog/mirrorcode-preliminary-results#appendi....

Ideally one would do these benchmarks with held-out proprietary software, but that comes with many practical concerns.