Hacker News new | ask | show | jobs
by nonbel 3080 days ago
If you cv on a dataset, then change the features (or hyperparameters) and cv again, picking the best model, then you will will overfit to the cv. This is data leakage, it will lead you to be overly optimistic about your model performance on unseen data.

This is well known, and honestly only takes one time working with a real hold out set (no cheating) to learn for life. Eg: https://datascience.stackexchange.com/questions/17288/why-k-...

1 comments

The final performance evaluation does not use cross-validation, but uses totally held out validation data not used during model selection.
Thanks this is not at all clear from the pre-print. From the final paper it does seem you are right, but the datasets and usage probably could be a bit clearer (eg include a table with that info).