This is actually a really, really good article. I like the way that the author both writes clearly and entertainingly about a reasonably complex topic.
The disconnect between static and interactive data analysis that is at the heart of the post is probably the most ignored issue in science.
To be honest, its hard not to ignore it given the implications of it (that we only get one shot at a set of test/validation/experimental data) and if we mess up, we're screwed.
No theory (random classifiers) aggregated to optimize on a non representative hold out set form a theory on that set? I think this is expected. If you create classifiers that express some domain theory on the training set in step 1. and use the information in the hold out differently you'll do a lot better (I believe - well I think I saw that result when I did my Ph.D 17 years ago).
Here is a very bad, very bad, very old, very old, AAAI workshop paper that sums up the idea (the journal paper is behind a pay wall.
This paper [1] by Bergstra, Cox is one of my favorites on competing without looking at the data. They were actually able to design the model before the data was even released (!)
This article illustrates the problem with over-fitting a model even when some data is withheld for testing. This is a trap that one can fall into when using training and testing sets.
The disconnect between static and interactive data analysis that is at the heart of the post is probably the most ignored issue in science.
To be honest, its hard not to ignore it given the implications of it (that we only get one shot at a set of test/validation/experimental data) and if we mess up, we're screwed.