|
|
|
|
|
by bytepoet
361 days ago
|
|
This is very cool. I enjoyed going through the writeup and GitHub README. I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication. I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this: FlashDMoE: Fast Distributed MoE in a Single Kernel - https://arxiv.org/pdf/2506.04667 |
|
Thanks for sharing the FlashDMoE work. Our next step is to support MoE models. Stay tuned!