|
|
|
|
|
by wannesm
1557 days ago
|
|
I agree this is not how you should evaluate. But it is not how things are done in (correctly executed) ML. What is described is a validation set, not a test set. The test set should not be involved in training or(hyper)parameter optimization at any time. |
|
The pragmatic reason for that is that if a team spends a month and a few thousand dollars developing and tuning a system, and then they find that the first time they test it on their held-out test set it doesn't work, there is no way that they'll just drop it and accept that all their effort and funds got to waste. They'll just keep trying until their system retruns the best results on the test set. At which point they've overfitted their system to the test set.
It's rare that this is described as clearly as in the paper I quote from but I think that's because the authors of the paper are not machine learnig people. In most machine learning papers you really have to squint between the lines to be sure what's been done.
Edit: btw, this is the paper I quoted from, linked by the author upthread:
https://arxiv.org/abs/2108.00275
I should have posted my comment under theirs but I got distracted and posted it on top instead.