| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yorwba 50 days ago
	To get pure grokking, you need a model large enough to easily memorize the entire training data and keep training for a long time after memorization. In practice, you'll probably use a more realistically-sized model that might grok on some subset of the data, but not so strongly that it's extremely obvious.

1 comments

hashta 50 days ago

I think I trained models with #params >> #training examples for hundreds of epochs, but still don't recall seeing that loss curve on real data. Curious if others have seen it with larger models or much longer runs

link