Hacker News new | ask | show | jobs
by CuriouslyC 843 days ago
This. Hyperparameter tuning and training include a lot of model specific black magic. Transformers have had time to mature, it'll take a while for other stuff to catch up even if they have a higher potential ceiling.
2 comments

Definitely agree that a lot of work going into hyperparameter tuning and maturing the ecosystem will be key here!

I'm seeing the Mamba paper as the `Attention Is All You Need` of Mamba - it might take a little while before we get everything optimised to the point of a GPT-4 (it took 6 years for transformers but should be faster than that now with all the attention on ML)

Another interesting one is that the hardware isn't really optimised for Mamba yet either - ideally we'd want more of the fast SRAM so that we can store more larger hidden states efficiently