Hacker News new | ask | show | jobs
by alicez 4031 days ago
Don't be sad. I'm happy to update the blog post as needed. By overfitting, do you mean over-optimizing the results on the validation set? Based on what I understand about nested CV, it is only necessary if 1. the hold-out validation set is way too small and not representative of the overall data distribution, or 2. if the model training procedure itself is unstable and produces models with wildly varying results on the same dataset.

To prevent overfitting to the training data, one performs hold-out validation or cv or early stopping in the training process.

To prevent overfitting of hyperparameters to a small validation dataset, or to mitigate the variance of the model training outcome, one can use nested cv.

Is that along the lines of you were looking for?

1 comments

It is a common misconception and a huge source of disappointment with ML -- without proper validation of the whole model building procedure (method selection + parameter tuning + feature selection + fitting) no amount of data and magic tricks will make you sure that there is no overfitting. Even a single hold-out test is risky because gives you no idea about the expected accuracy variance.
Well, you can use the bootstrap to calculate the variance. It costs computation. But it works. Cosma Shalizi wrote a really nice introduction to it: http://www.americanscientist.org/issues/pub/2010/3/the-boots...