Hacker News new | ask | show | jobs
by _0ffh 147 days ago
You'd be surprised how quickly improvement of autoregressive language models levels off with epoch count (though, admittedly, one epoch is a LOT). Diffusion language models otoh indeed keep profiting for much longer, fwiw.
1 comments

Does this also apply to LLM training at scale? I would be a bit surprised if it does, fwiw.
Yup, as soon as data is the bottleneck and not compute, diffusion wins. Tested following the Chinchilla scaling strategy from 7M to 2.5B parameters.

https://arxiv.org/abs/2507.15857