Hacker News new | ask | show | jobs
by driverdan 954 days ago
This is a big problem with independent LLM testing. You need to make sure your test set isn't included in the training set which isn't easy with closed source models.

This makes me think of how hardware manufacturers optimize for benchmarks. Closed source LLMs can intentionally include likely test data in their training set to artificially inflate results. I'm not saying they are intentionally doing that now, but they could.