Hacker News new | ask | show | jobs
by yorwba 50 days ago
To get pure grokking, you need a model large enough to easily memorize the entire training data and keep training for a long time after memorization. In practice, you'll probably use a more realistically-sized model that might grok on some subset of the data, but not so strongly that it's extremely obvious.
1 comments

I think I trained models with #params >> #training examples for hundreds of epochs, but still don't recall seeing that loss curve on real data. Curious if others have seen it with larger models or much longer runs