|
|
|
|
|
by bravura
588 days ago
|
|
Regarding keeping the test set private to avoid contamination, the comments about leakage are spot on. The real test set should always be the future. We should evaluate LLMs on text from beyond their knowledge cutoff date, by computing their per-byte perplexity or per-byte compression ratio. There's a deep theoretical connection between compression and learning. The intuition here is that being able to predict the future of science (or any topic, really) is indicative of true understanding. Slightly more formally: When ICLR 2025 announces and publishes the accepted papers, Yoshua Bengio is less surprised/perplexed by what's new than a fresh PhD student. And Terence Tao is less surprised/perplexed by what will be proven in math in the next 10 years than a graduate student in a related field. This work has it right: https://ar5iv.labs.arxiv.org/html//2402.00861 |
|