Hacker News new | ask | show | jobs
by baobabKoodaa 1070 days ago
I have personally never seen a situation where more training data (of similar quality) causes the model to perform worse. Have you seen such a situation? Please provide example.

Your suggestion of running 1000 training runs with different subsets of data sounds excessive and unnecessary to me.

2 comments

You have to know when to stop training. How are you going to do that without a test set? How do you know when you have achieved generalization without over-fitting?
Early stopping is just one way of regularization. You can use L2 or dropout and then you can train until your model converges.
Usually I develop models with a train/validation/test split, where I'm measuring results on the validation set to decide the appropriate number of epochs to use. Then I burn the test set to evaluate performance. Then I train from scratch on the entire dataset (no split) and I use the same number of epochs to train here. Is this number of epochs optimal when the dataset is different? Of course not. But when you use regularization and other methods to combat overfitting appropriately, your training is not going to be overly sensitive to changes in epoch number anyway.
In the case of fine tuning, you can end up with catastrophic forgetting. Architecture can influence how data scales, and adding data doesn’t always improve performance