Hacker News new | ask | show | jobs
by espadrine 845 days ago
Another element is that Mamba required a very custom implementation down to custom fused kernels which I expect would need to be implemented in deepspeed or the equivalent library for a larger training run spanning thousands of GPUs.
1 comments