| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nonbel 3080 days ago
	If you cv on a dataset, then change the features (or hyperparameters) and cv again, picking the best model, then you will will overfit to the cv. This is data leakage, it will lead you to be overly optimistic about your model performance on unseen data. This is well known, and honestly only takes one time working with a real hold out set (no cheating) to learn for life. Eg: https://datascience.stackexchange.com/questions/17288/why-k-...

1 comments

michaelhoffman 3080 days ago

The final performance evaluation does not use cross-validation, but uses totally held out validation data not used during model selection.

link

nonbel 3079 days ago

Thanks this is not at all clear from the pre-print. From the final paper it does seem you are right, but the datasets and usage probably could be a bit clearer (eg include a table with that info).

link