Hacker News new | ask | show | jobs
by FanaHOVA 1063 days ago
Presented with no comment :) https://twitter.com/chhillee/status/1635790330854526981?s=46...
2 comments

Having worked on ML products, there is sometimes debate on whether you should train on the test partition prior to prod deployment - after all, why would you ship a worse model to prod? Obviously you can't tell whether the model is better at generalization compared to an alternate technique, and you also incur some overfit risk. But many industrial problems are solvable through memorization.
> after all, why would you ship a worse model to prod?

...because you need a control to evaluate how well your product is doing? I know it's a young field, but boy, do some folk love removing the "science" from "data science"

You can evaluate a version of the model that has been trained on one set of data, and ship to production a different model that has been trained on the complete set of data. In many cases one can reasonably infer that the model which has seen all of the data will be better than the model which has seen only some of the data.

I'm not claiming that's what happened here, nor am I interested in nitpicking "what counts as 'science'". I'm just saying this is a reasonable thing to do.

This is possible if you use e.g. train 1000 models on different subsets of data and verify that each and every one of them is performing well. In that case, you can reasonably infer that another model trained on all data would work well, too.

But this is, of course, 1000 times more expensive to do. And if you only train 100, or 10, or 1 model, then the deduction becomes increasingly unstable.

So from a practical point of view, it's probably not feasible, because you would put those resources into something else instead that has more ROI.

I have personally never seen a situation where more training data (of similar quality) causes the model to perform worse. Have you seen such a situation? Please provide example.

Your suggestion of running 1000 training runs with different subsets of data sounds excessive and unnecessary to me.

You have to know when to stop training. How are you going to do that without a test set? How do you know when you have achieved generalization without over-fitting?
In the case of fine tuning, you can end up with catastrophic forgetting. Architecture can influence how data scales, and adding data doesn’t always improve performance
>infer that the model which has seen all of the data will be better than the model which has seen only some of the data.

It really depends upon the data. A smaller set of data that mostly consists of "truth" might be better than a larger dataset that also has many "lies".

Perhaps what you mean is that the model might be more representative, rather than _better_.

There are offline metrics and online metrics. Offline metrics might be something like AUROC on a test set. Once you’ve pushed the model online, you can check the online metrics. Ultimately the online metrics are more important, that’s the whole reason the model exists in the first place.

Your control in an online environment is the current baseline. You don’t need to save the test set anymore, you can push it online and test it directly.

Why would you want to ship an untested model? That's insane.
This is a common approach, for example, in data science competitions. Why? Well, if you want to maximize the model's abilities, this is what you have to do. (Not saying Llama 2 is released like this; it probably isn't)
Yeah but in competitions there's a secret test set used to evaluate the model.
I have personally shipped "untested" models in production in situations where a "secret test set" does not exist. (Train on subset of data -> evaluate on different subset of data -> train again on entire dataset).

I do not consider myself to be insane.

I didn't mean to insult anyone. The idea of not knowing the actual performance of the model just intuitively seems to me like it's a bit of a gamble. I have only trained models in a scientific context before, where this was never an option.