|
|
|
|
|
by neilmovva
295 days ago
|
|
Today, training in "low precision" probably means computing FP8 x FP8 -> FP32. The FP32 accumulation is still important, but otherwise yes this works, especially if we're talking about MXFP8 as supported on Blackwell [0]. What's less proven is a recipe using MXFP4 x MXFP4 -> FP32 compute, e.g. [1], which needs more involved techniques to work. But if you get it to work stably, that pathway is running at full throughput on 5090. [0]: https://arxiv.org/abs/2506.08027
[1]: https://arxiv.org/abs/2502.20586 |
|