| HN Mirror

You are right of course that it shouldn't, but in practice it always is.

The pragmatic reason for that is that if a team spends a month and a few thousand dollars developing and tuning a system, and then they find that the first time they test it on their held-out test set it doesn't work, there is no way that they'll just drop it and accept that all their effort and funds got to waste. They'll just keep trying until their system retruns the best results on the test set. At which point they've overfitted their system to the test set.

It's rare that this is described as clearly as in the paper I quote from but I think that's because the authors of the paper are not machine learnig people. In most machine learning papers you really have to squint between the lines to be sure what's been done.

Edit: btw, this is the paper I quoted from, linked by the author upthread:

https://arxiv.org/abs/2108.00275

I should have posted my comment under theirs but I got distracted and posted it on top instead.