Hacker News new | ask | show | jobs
by light_hue_1 584 days ago
If I was going to bet, I would bet yes, they will reach above 85% performance.

The problem with all benchmarks, one that we just don't how to solve, is leakage. Systematically, LLMs are much better at benchmarks created before they were trained than after. There are countless papers that show significant leakage between training and test sets for models.

This is in part why so many LLMs are so strong according to benchmarks, particularly older popular benchmarks, but then prove to be so weak in practice when you try them out.

In addition to leakage, people also over-tune their LLMs to specific datasets. They also go out and collect more data that looks like the dataset they want to perform well on.

There's a lot of behind the scenes talk about unethical teams that collect data which doesn't technically overlap test sets, but is extremely close. You can detect this if you look at the pattern of errors these models make. But no one wants to go out and accuse specific teams, at least not for now.

2 comments

Could you run the benchmark by bootstrapping (average of repeated subsampling), instead of a straight-across performance score, and regain some leakage resistance that way? As well as a better simulation of "out of sample" data, at least for a little while.
This benchmark’s questions and answers will be kept fully private, and the benchmark will only be run by Epoch. Short of the companies fishing out the questions from API logs (which seems quite unlikely), this shouldn’t be a problem.
> answers will be kept fully private

> Short of the companies fishing out the questions from API logs (which seems quite unlikely)

They all pretty clearly state[1] versions of "We use your queries (removing personal data) to improve the models" so I'm not sure why that's unlikely.

https://help.openai.com/en/articles/5722486-how-your-data-is...

Ideally they would have batches of those exercises, where the only use the next batch when someone has solved a suspicious amount of those exercises. If it performs much worse on the next batch, that is a tell of leakage.
I looked at the sample questions and even if they get the questions there is no way they will figure out the answers without making significant breakthroughs in understanding mathematics and logic.