|
> That paper shows that imagenet classifiers retain their ranking on data drawn from the same distribution as the training set. That isn't even close to the same thing as generalising to real world data. Not the same distribution, it's new data collected and processed according to the same recipe. A quite different distribution, demonstrated by the fact that the accuracy numbers drop sharply. That's why it's so surprising that the rankings do not change that much. (Okay, in principle, a possible explanation is that it is the exact same distribution, with a fixed percentage of mislabeled or impossibly hard-to-label datapoints added. Appendix B2 of the paper deals with this possibility.) In any case, I fully agree that this kind of generalization is still much easier than generalizing to real world data. > Again, I don't actually think imagenet was as problematic as other competitions, but there is better evidence for that (not the least of which is that for the first half of imagenet's life, the differences in models were large, the number of tests was cumulatively fairly small, and the test set was huge: ie what I wrote supports imagenet as fairly reliable). CIFAR-10 is basically the opposite of your list of requirements. Train set small, test set small, test set public, small number of labels, grid searched to death. And yet, look at the CIFAR-10 graph from that paper. The exact same pattern as ImageNet. |