| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by laidoffamazon 343 days ago
	Isn't the new trend to train in lower precision anyway?

2 comments

neilmovva 343 days ago

Today, training in "low precision" probably means computing FP8 x FP8 -> FP32. The FP32 accumulation is still important, but otherwise yes this works, especially if we're talking about MXFP8 as supported on Blackwell [0].

What's less proven is a recipe using MXFP4 x MXFP4 -> FP32 compute, e.g. [1], which needs more involved techniques to work. But if you get it to work stably, that pathway is running at full throughput on 5090.

[0]: https://arxiv.org/abs/2506.08027 [1]: https://arxiv.org/abs/2502.20586

link

laidoffamazon 343 days ago

Interesting. My assumption was one of the innovations of DeepSeek and the modern GPT models was performing low precision pretraining rather than just finetuning further. I didn't realize you still need accumulation at a higher precision anyway

link

storus 343 days ago

Only GPU-poors run Q-GaLore and similar tricks.

link

Twirrim 342 days ago

Even the large cloud AI services are focusing on this too, because it drives down the average "cost per query", or whatever you want to call it. For inference, arguably more even than training, the smaller and more efficient they can get it, the better their bottom line.

link

storus 342 days ago

For inference of course; the OP I replied to mentioned training though.

link