| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by DougBTX 1069 days ago
	Here's another way to look at it. The test set is an approximation for how the model will perform against production data, but the actual performance of the model is how it performs for actual end-users. So real _actual_ results are always unknown util after the fact. Given that, if the metrics from training clearly show that more data == better model, and there's no reason to expect that trend to reverse, then the logical thing to do is maximise the data used for training to get the best results for actual production data. Doing this does complicate decisions for releasing subsequent model updates, as the production model can't be directly compared against new iterations any more. Instead a pre-production model would need to be used, that has not seen the test set. However, if data drift is likely, then re-using the old test set wouldn't be useful anyway.

2 comments

lumost 1069 days ago

Another way of thinking about it. If training on all the data yields a model which is functionally 5% better in online metrics, which would not be uncommon in a pareto distributed traffic pattern - then any subsequent partitioned model would likely perform worse than the prod model.

More complication arises when users expect that things which worked previously in one way - continue working in this way. Users don't really care that their traffic was in the test set. In an even more extreme case, many industrial problems have a high correlation between the traffic today and the traffic next week, An optimal solution for such a situation would be to complete a full memorization today's traffic and use that for next week. In many cases, an overfit model can effectively perform this memorization task with fewer parameters/infrastructure than an actual dictionary lookup.

link

nightski 1069 days ago

You act like training is this pre-set process you just "do". That's not the case, you train until you reach desired performance on the test set. If you don't have a test set how do you know when to stop training and avoid overfitting?

link

baobabKoodaa 1069 days ago

You're confusing training epochs with dataset size.

I'm simplifying now, but you can think of epochs as "how many times we train over the entire dataset? 1 time? 10 times?"

Correspondingly, you can think of dataset size as "how many Wikipedia pages we include in the dataset? 1 million? 10 million?"

Now let's think about overfitting.

What happens when you increase epochs is the model is more likely to overfit your data.

What happens when you increase dataset size is the model is less likely to overfit your data.

link