|
|
|
|
|
by light_hue_1
584 days ago
|
|
If I was going to bet, I would bet yes, they will reach above 85% performance. The problem with all benchmarks, one that we just don't how to solve, is leakage. Systematically, LLMs are much better at benchmarks created before they were trained than after. There are countless papers that show significant leakage between training and test sets for models. This is in part why so many LLMs are so strong according to benchmarks, particularly older popular benchmarks, but then prove to be so weak in practice when you try them out. In addition to leakage, people also over-tune their LLMs to specific datasets. They also go out and collect more data that looks like the dataset they want to perform well on. There's a lot of behind the scenes talk about unethical teams that collect data which doesn't technically overlap test sets, but is extremely close. You can detect this if you look at the pattern of errors these models make. But no one wants to go out and accuse specific teams, at least not for now. |
|