| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by janalsncm 7 days ago
	It’s a data point. I could imagine in a hardware constrained setting we might not care about training on enormous token counts, and on smaller devices it’s great if we can simplify the architecture. I agree that this isn’t proof that it scales to trillions of tokens, but this does show a scaled up experiment would be worth a shot.

1 comments

Philpax 7 days ago

The Chinchilla scaling laws give you a minimum for the number of tokens you should be using for a given size: if you can't meet what they suggest for that size, you should shrink the size, as, otherwise, the capacity of the model is going to waste.

I do agree that it is a datapoint, but GP's point is that this model was undertrained, so it's hard to draw the same conclusions from it that we would from other research.

link