| > And that suffices to explain its performance, without the need for any mysterious post-overfitting grokking ability. It actually still does not suffice. It is just not expected no matter what the authors would be doing. Just the fact that they managed to get that effect is interesting. Granted, the phenomenon may be limited in scope. For example, on ImageNet it may require ridiculously long time scales. But maybe there is some underlying reason we can exploit to get to grokking faster. It's basically all in fig 2.: - they use 3 random seeds per result - they show results for 12 different simple algorithmic datasets - they evaluate 12 different combinations of hyperparameters - for each hyperparameters combination they use 10+ different ratios of train to validation splits So they do some 10*12*3*2 = 720 runs. They conclude that hyperparameters are important. Seems like weight decay is especially important for the grokking phenomenon to happen when model has access to low ratio of training data. Also, at least 2 other people managed to replicate that results: https://twitter.com/sea_snell/status/1461344037504380931 https://twitter.com/lieberum_t/status/1480779426535288834 One hypothesis may be that models are just biased to randomly stumble upon wide, flat local minima. And wide, flat local minima generalize well. |
You are impressed by the fact that one particular, counter-intuitive result was obtained, but of course there is an incentive to publish something that stands out, rather than something less notable. There is a well-known paper by John Ioannidis on cognitive biases in medical research:
Why most published research findings are false
https://journals.plos.org/plosmedicine/article?id=10.1371/jo...
It's not about machine learning per sé, but its observations can be applied to any field where empirical studies are common, like machine learning.
Especially in the field of deep learning where scholarly work tends to be primarily empirical and where understanding the behaviour of systems is impeded by the black-box nature of deep learning models, observing something mysterious and unexpected must be cause for suspicion and scrutiny of methodology, rather than accepted unconditionally as an actual observation. In particular, any hypothesis that tends towards magick, for example suggesting that a change in quantities (data, compute, training time) yields qualitative improvements (prediction transmogrifying into understanding, overfitting transforming into generalisation), should be discarded with extreme prejudice.