Hacker News new | ask | show | jobs
by zmjjmz 4578 days ago
>stop and think about possible sources of contamination

One great one from my Machine Learning professor was an assignment where we were required to normalize our data to [0,1]. After doing this and then going through the typical cross-validation cycle, he had us try and figure out where we contaminated our validation sets. As it turns out, we all normalized our data before splitting it up, which meant that training data influenced testing data.

It's a simple fix, but if you've done that and gone to run a large convolutional neural network for a week only to find that you made a stupid error like that, it can be pretty painful. (Especially since the bad generalization error might not be obvious until you use it the model in production)

2 comments

Maybe one could benefit from a sort of blinding procedure, where the person designing the learner is never allowed to even look at the validation data.
If both your training and testing datasets are representative of actual data, wouldn't the normalization function be nearly equivalent in both datasets?