|
|
|
|
|
by espadrine
845 days ago
|
|
Another element is that Mamba required a very custom implementation down to custom fused kernels which I expect would need to be implemented in deepspeed or the equivalent library for a larger training run spanning thousands of GPUs. |
|
https://www.reddit.com/r/MachineLearning/comments/1amb3xu/d_...