| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mlepath 575 days ago

Yea, people have a really hard time dealing with data leakage especially on data sets as large as LLMs need.

Basically if something appeared online or was transmitted over the wire should no longer be eligible to evaluate on. D. Sculley had a great talk at NeurIPS 2024 (same conference this paper was in) titled Empirical Rigor at Scale – or, How Not to Fool Yourself

Basically no one knows how to properly evaluate LLMs.

1 comments

refulgentis 575 days ago

No, an absolute massive amount of people do. In fact they have been doing exactly as you recommend, because as you note, it's obvious and required for a basic proper evaluation.

link