| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by linolevan 106 days ago
	There was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead. [0] https://www.alphaxiv.org/abs/2509.14786

1 comments

sdpmas 106 days ago

yeah, we do incorporate some of the findings from the paper in our repo! like aggressive regularization and ensembling.

link

_0ffh 106 days ago

I see you already mention diffusion - iirc there was a result not too long ago that diffusion models keep improving with more epochs for longer than AR models do.

link

sdpmas 106 days ago

diffusion is promising, but still an open question how much data efficient they are compared to AR. in practice, you can also train AR forever with high enough regularization, so let's see.

link

_0ffh 106 days ago

Yes, it could go either way of course.

Still, just for reference, here's the paper I remembered: https://arxiv.org/pdf/2507.15857

link

sdpmas 106 days ago

thanks, here's another one: https://arxiv.org/abs/2511.03276

link