Hacker News new | ask | show | jobs
by thatguysaguy 846 days ago
Too new is definitely one thing. Someone is going to have to make a gamble to actually paying for a serious pretraining run with this architecture before we know how it really stacks up against transformers.

There are some papers suggesting that transformers are better than SSMs in fundamental ways (e.g. They cannot do arbitrary key-based recall from their context: https://arxiv.org/abs/2402.01032). This means it's not just a no-brainer to switch over.

3 comments

Another element is that Mamba required a very custom implementation down to custom fused kernels which I expect would need to be implemented in deepspeed or the equivalent library for a larger training run spanning thousands of GPUs.
It's a reasonably easy bet that Together is doing or will do a serious pretraining run with Mamba, where if that's a success other players might start considering it more.
> There are some papers suggesting that transformers are better than SSMs in fundamental ways

I mean the vanilla transformers are also shown failing at the tasks they present.