Hacker News new | ask | show | jobs
by nonbel 3080 days ago
>"We found that adding these same features from the CFD model further boosted performance and so also included these. The final deployed model was trained only on the Avana data (combining with Gecko did not increase cross-validation performance)." https://www.biorxiv.org/content/early/2016/10/05/078253

Sounds like you leaked info from the training data into validation/test data, which will make you overfit and thus overstate the accuracy. I may have missed it, but did you evaluate the skill of this model on a holdout dataset?

EDIT:

This link doesn't appear to work:

>"All source code and a front-end website for the cloud service will be made available from http://research.microsoft.com/en-us/projects/crispr upon publication."

1 comments

No, there was no leakage. We trained on one dataset and evaluated on a completely different one, then did the reverse to show that the model generalized well irrespective of the training data (Figure 2). The decision of which model to deploy was based on cross-validation over the Avana data. We would have loved to have even more data, but generating data from this kind of experiment is expensive and labor-intensive.

EDIT: we will update the link, thanks. The correct link is https://www.microsoft.com/en-us/research/project/crispr/

If you cv on a dataset, then change the features (or hyperparameters) and cv again, picking the best model, then you will will overfit to the cv. This is data leakage, it will lead you to be overly optimistic about your model performance on unseen data.

This is well known, and honestly only takes one time working with a real hold out set (no cheating) to learn for life. Eg: https://datascience.stackexchange.com/questions/17288/why-k-...

The final performance evaluation does not use cross-validation, but uses totally held out validation data not used during model selection.
Thanks this is not at all clear from the pre-print. From the final paper it does seem you are right, but the datasets and usage probably could be a bit clearer (eg include a table with that info).