|
|
|
|
|
by tadamcz
44 days ago
|
|
I agree it's a potentially big problem, affecting almost any benchmark out there. We discuss it briefly in "Appendix A: Contamination and memorization" https://epoch.ai/blog/mirrorcode-preliminary-results#appendi.... Ideally one would do these benchmarks with held-out proprietary software, but that comes with many practical concerns. |
|