| There's actually a case in the early history of perceptrons that brings up this exact issue: "There is a humorous story from the early days of machine learning about a network that was supposed to be trained to recognize tanks hidden in forest regions. The network was trained on a large set of photographs – some with tanks and some without tanks. After learning was complete the system appeared to work well when “shown” additional photographs from the original set. As a final test, a new group of photos were taken to see if the network could recognize tanks in a slightly different setting. The results were extremely disappointing. No one was sure why the network failed on this new group of photos. Eventually, someone noticed that in the original set of photos the network had been trained on, all of the photos with tanks had been taken on a cloudy day, while all of the photos without tanks were taken on a sunny day. The network had not learned to detect the difference between scenes with tanks and without tanks, it had instead learned to distinguish photos taken on cloudy days from photos taken on sunny days!"[0] The pragmatic answer is that this is why you have two hold-out sets: cross validation/dev set and the test set. Typically you keep 70% of the data for training, 15% of the data for CV and 15% for Test. Ideally you should shuffle the data enough that there isn't any bias in the natural order of the data. You train the model on the train data, and estimate how well the model actually performs on the CV set which the model did not see in training. You continue to use the CV set while you tweak parameters, try out new models etc. At this point you may have "cheated" a bit because you only kept things that worked well on your CV data. Finally when you say "this is done!" you try out your model on the Test data set. Of course it's still possible that you would have the even/odd issue, and the answer to this whole set of issues is "healthy skepticism", and checking for these types of errors. Take for example this Sentence Completion Challenge from Microsoft Research [1] They claim some astounding results on correctly predicting GRE type questions using a very simple model (LSA for those who care). These results seemed impossible! But it turns out they cheated by training the model only on possible answers (which is akin to studying for the actually GRE by only review the possible answers that will be on the exam). We tend to obsess over p-values and test validation scores as a substitute for reasoning. But all research papers should be read as an argument a friend is making to you, "I've done this incredible thing... ", and no single number should replace reasoned inquisition into possible errors. [0] http://watson.latech.edu/WatsonRebootTest/ch14s2p4.html [1] http://research.microsoft.com/apps/pubs/?id=157031 |