| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by _0ffh 147 days ago
	You'd be surprised how quickly improvement of autoregressive language models levels off with epoch count (though, admittedly, one epoch is a LOT). Diffusion language models otoh indeed keep profiting for much longer, fwiw.

1 comments

Does this also apply to LLM training at scale? I would be a bit surprised if it does, fwiw.

Yup, as soon as data is the bottleneck and not compute, diffusion wins. Tested following the Chinchilla scaling strategy from 7M to 2.5B parameters.