| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andrewprock 1925 days ago
	How much data do you need to mitigate the risk of over fitting a trillion parameter model?

1 comments

gwern 1925 days ago

You ideally need ~500GB of text, or so. EleutherAI's The Pile was designed to be just big enough to fit a 1t GPT efficiently, and you can get the various scaling curves out of the OA-related scaling papers. (You want the amount of data that fits into a single epoch, because if you reuse data, you get less bang for the FLOPs buck, and FLOPS constraints are right now much more binding than data or model size.)

link

andrewprock 1924 days ago

This feels off by a couple of orders of magnitude, unless a significant number of the parameters are not independent.

link

singhrac 1924 days ago

Well, that's the "magic" of modern deep learning. You can fit models with p > n somehow without overfitting. In some areas you might find this called "the strong inductive bias of neural networks" or "double descent" but no one has found a convincing explanation (to me).

link

gwern 1924 days ago

It's quite amusing. The standard statistical theory does not work at all in estimating data vs model size, and the bounds are all vacuously large. It's a very active area of research, understanding why models act so simple when overparameterized and coming up with real measures of model complexity. Lots to read there if you are interested in such things.

link

andrewprock 1924 days ago

That just means that the parameters are not independent.

link

gwern 1924 days ago

But you can fit randomly-generated labels!

link

andrewprock 1922 days ago

That's not in any way surprising. When you have more parameters than data, this is trivial.

link