|
|
|
|
|
by YeGoblynQueenne
1595 days ago
|
|
>> The point of the paper is to show that NN can still learn long after fully memorizing the train dataset. That is the claim in the paper. I don't understand how it is supported by measuring results on the validation set. Figure 3 looks nice but it doesn't say anything on its own. I don't know what's the best way to interpret it. The paper offers some interpretation that convinces you, but not me. Sorry, this kind of work is too fuzzy for me. What happened to good, old-fasion proofs? |
|
Everyone's expectation would be that this is it. The model is overfitted, so it is useless. The model is as good as a hash map, 0 generalization ability.
The paper provides empirical, factual evidence that as you continue training there is still something happening in the model. After the model memorized the whole training dataset and while it still has not received any feedback information from the validation dataset, it starts to figure out how to solve validation dataset.
Mind you, this is not interpretation, this is factual. Long after 100% overfitting, the model is able to keep increasing its accuracy on dataset it has not seen.
It's as we discovered that water can flow upwards.
Grokking was discovered by someone forgetting to turn off their computer.
Nobody knows why. So, nobody is able to make any theoretical deductions about it.
But I agree that fig 3. requires interpretation. By itself it does not say a lot, but similar structures appear in other models like in the one where we discuss elements sequence prediction. To me, the models figure out some underlying structure of the problem, and we are able to interpret that structure.
I tend to look at it from Bayesian perspective. This type of evidence increases my belief that the models are learning what I would call semantics. It's a separate line of evidence from looking at benchmark results. Here we can get a glimpse at how some models may be doing some simple predictions and it does not look like memorization.