| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thatguysaguy 846 days ago
	Too new is definitely one thing. Someone is going to have to make a gamble to actually paying for a serious pretraining run with this architecture before we know how it really stacks up against transformers. There are some papers suggesting that transformers are better than SSMs in fundamental ways (e.g. They cannot do arbitrary key-based recall from their context: https://arxiv.org/abs/2402.01032). This means it's not just a no-brainer to switch over.

3 comments

espadrine 845 days ago

Another element is that Mamba required a very custom implementation down to custom fused kernels which I expect would need to be implemented in deepspeed or the equivalent library for a larger training run spanning thousands of GPUs.

link

cs702 845 days ago

Not necessarily:

https://www.reddit.com/r/MachineLearning/comments/1amb3xu/d_...

link

gaogao 846 days ago

It's a reasonably easy bet that Together is doing or will do a serious pretraining run with Mamba, where if that's a success other players might start considering it more.

link

whimsicalism 845 days ago

> There are some papers suggesting that transformers are better than SSMs in fundamental ways

I mean the vanilla transformers are also shown failing at the tasks they present.

link