Hacker News new | ask | show | jobs
by snowstormsun 1062 days ago
Why would you want to ship an untested model? That's insane.
1 comments

This is a common approach, for example, in data science competitions. Why? Well, if you want to maximize the model's abilities, this is what you have to do. (Not saying Llama 2 is released like this; it probably isn't)
Yeah but in competitions there's a secret test set used to evaluate the model.
I have personally shipped "untested" models in production in situations where a "secret test set" does not exist. (Train on subset of data -> evaluate on different subset of data -> train again on entire dataset).

I do not consider myself to be insane.

I didn't mean to insult anyone. The idea of not knowing the actual performance of the model just intuitively seems to me like it's a bit of a gamble. I have only trained models in a scientific context before, where this was never an option.
Here's another way to look at it. The test set is an approximation for how the model will perform against production data, but the actual performance of the model is how it performs for actual end-users. So real _actual_ results are always unknown util after the fact. Given that, if the metrics from training clearly show that more data == better model, and there's no reason to expect that trend to reverse, then the logical thing to do is maximise the data used for training to get the best results for actual production data.

Doing this does complicate decisions for releasing subsequent model updates, as the production model can't be directly compared against new iterations any more. Instead a pre-production model would need to be used, that has not seen the test set. However, if data drift is likely, then re-using the old test set wouldn't be useful anyway.

Another way of thinking about it. If training on all the data yields a model which is functionally 5% better in online metrics, which would not be uncommon in a pareto distributed traffic pattern - then any subsequent partitioned model would likely perform worse than the prod model.

More complication arises when users expect that things which worked previously in one way - continue working in this way. Users don't really care that their traffic was in the test set. In an even more extreme case, many industrial problems have a high correlation between the traffic today and the traffic next week, An optimal solution for such a situation would be to complete a full memorization today's traffic and use that for next week. In many cases, an overfit model can effectively perform this memorization task with fewer parameters/infrastructure than an actual dictionary lookup.

You act like training is this pre-set process you just "do". That's not the case, you train until you reach desired performance on the test set. If you don't have a test set how do you know when to stop training and avoid overfitting?