Is it possible that some LLM’s are trained on these benchmarks? Which would mean they’re overfitting and are incorrectly ranked? Or am I misunderstanding these benchmarks?…
Having worked on ML products, there is sometimes debate on whether you should train on the test partition prior to prod deployment - after all, why would you ship a worse model to prod? Obviously you can't tell whether the model is better at generalization compared to an alternate technique, and you also incur some overfit risk. But many industrial problems are solvable through memorization.
> after all, why would you ship a worse model to prod?
...because you need a control to evaluate how well your product is doing? I know it's a young field, but boy, do some folk love removing the "science" from "data science"
You can evaluate a version of the model that has been trained on one set of data, and ship to production a different model that has been trained on the complete set of data. In many cases one can reasonably infer that the model which has seen all of the data will be better than the model which has seen only some of the data.
I'm not claiming that's what happened here, nor am I interested in nitpicking "what counts as 'science'". I'm just saying this is a reasonable thing to do.
This is possible if you use e.g. train 1000 models on different subsets of data and verify that each and every one of them is performing well. In that case, you can reasonably infer that another model trained on all data would work well, too.
But this is, of course, 1000 times more expensive to do. And if you only train 100, or 10, or 1 model, then the deduction becomes increasingly unstable.
So from a practical point of view, it's probably not feasible, because you would put those resources into something else instead that has more ROI.
I have personally never seen a situation where more training data (of similar quality) causes the model to perform worse. Have you seen such a situation? Please provide example.
Your suggestion of running 1000 training runs with different subsets of data sounds excessive and unnecessary to me.
>infer that the model which has seen all of the data will be better than the model which has seen only some of the data.
It really depends upon the data. A smaller set of data that mostly consists of "truth" might be better than a larger dataset that also has many "lies".
Perhaps what you mean is that the model might be more representative, rather than _better_.
There are offline metrics and online metrics. Offline metrics might be something like AUROC on a test set. Once you’ve pushed the model online, you can check the online metrics. Ultimately the online metrics are more important, that’s the whole reason the model exists in the first place.
Your control in an online environment is the current baseline. You don’t need to save the test set anymore, you can push it online and test it directly.
This is a common approach, for example, in data science competitions. Why? Well, if you want to maximize the model's abilities, this is what you have to do. (Not saying Llama 2 is released like this; it probably isn't)
I have personally shipped "untested" models in production in situations where a "secret test set" does not exist. (Train on subset of data -> evaluate on different subset of data -> train again on entire dataset).
Given all of the times OpenAI has trained on peoples' examples of "bad" prompts, I am sure they are fine-tuning on these benchmarks. It's the natural thing to do if you are trying to position yourself as the "most accurate" AI.
Assuming they were doing that, Fine-tuning on benchmarks isn't the same as test leakage/testing on training data. No researcher is intentionally training on test data.
If it performs about as well in instances it has never seen before (test set) then it's not overfit to the test.
I'm confused, fine-tuning is training. How is that not leakage? I'm hesitant to call them researchers, they are employees of a for-profit company trying to meet investor expectations.
1.You train on the kind of problems you want to solve. you don't report numbers that evaluate performance based on examples it trained on. Datasets will typically have splits, one for training and another for testing.
2. Open ai is capped profit. They are also not a publicly traded company. researchers are researchers regardless of who they work for. Training on test data is especially stupid for commercial applications because customers find that out quick and any reputation is gone.
I am suggesting that OpenAI's main product is "LLM that benchmarks the best." From that point, it is completely illogical not to train on at least some of the test data (or data that is very similar to the test data) so that you can fudge the numbers in your favor. You don't want to go too far, but overfitting a tiny bit will make you look like you have a significant edge. When someone says that your product isn't that good, you then point to the benchmarks and say, "objective measures say that you are wrong." This is a tried and true marketing technique.
Hardware companies, which live and die on benchmarks, do this all the time. Meanwhile, it does appear that OpenAI is underperforming consumer expectations, and losing users quite quickly at this point, despite doing incredibly well on benchmarks.
Also, this isn't about profit. It's about market cap and it's about prestige. Those are not correlated to profit.
It would be a bit of a scandal, and IMO too much hassle to sneak in. These models are trained on massive amounts of text - specifically anticipating which metrics people will care about and generating synthetic data just for them seems extra.
I don't think it's a scandal, it's a natural thing that happens when iterating on models. OP doesn't mean they literally train on those tests, but that as a meta-consequence of using those tests as benchmarks, you will adjust the model and hyperparameters in ways that perform better on those tests.
For a particular model you try to minimally do this by separating a test and validation set, but on a meta-meta level, it's easy to see it happening.
You don't see an engineer at an extremely PR-conscious company at least checking how their model performs on popular benchmarks before rolling it out? And if its performance is lackluster, you do you really see them doing nothing about it? It probably doesn't make a huge difference anyway. I know those old vision models were overfitted to the standard image library benchmarks, but they were still very impressive.
This wasn't so much overtraining, as the models learning something different than what we expected. If you look at a pixel by pixel representation of an image, textures tend to be more significant/unique patterns than shapes. There are some funny studies from the mid 2010s exploring this.