| HN Mirror

What I'm pointing out above is that everyone games the benchmarks in the way that you say, by tuning their models until they do well on the test set. They train, they test, and they iterate until they get it right. At that point any results are meaningless for the purpose of estimating generalisation because models are effectively overfit to the test set, without ever having to train on the test set directly.

And this is standard practice, like everyone does it all the time and I believe a sizeable majority of researchers don't even understand that what they do is pointless because that's what they've been taught to do, by looking at each other's work and from what their supervisors tell them to do etc.

Btw, we don't really care about generalisation on the test set, per se. The point of testing on a held-out test set is that it's supposed to give you an estimate of a model's generalisation on truly unseen data, i.e. data that was not available to the researchers during training. That's the generalisation we're really interested in. And the reason we're interested in that is that if we deploy a model in a real-world situation (rare as that may be) it will have to deal with unseen data, not with the training data, nor with the test data.