| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jxmorris12 845 days ago

In case people are wondering why Mamba is exciting:

There's this idea in AI right now that "scaling" models to be bigger and train on more data always makes them better. This has led to a science of "scaling laws" which study just how much bigger models need to be and how much data we need to train them on to make them a certain amount better. The relationship between model size, training data size, and performance turns out to be quite predictable.

Transformers are great because they can continue scaling and giving us better performance – unlike, we think, RNNs. Probably the most exciting thing about Mamba is the claim that it can be a bit smaller, and train on a bit less data, and still provide better performance than the equivalent Transformer, especially at longer sequence lengths.

For more info, see the scaling laws plot in Figure 4 of the Mamba paper: https://arxiv.org/abs/2312.00752

3 comments

KuriousCat 845 days ago

People have shown even CNNs can match up the peformance of the transformers.

https://openreview.net/forum?id=TKIFuQHHECj#

I believe there is a lot of herding going on due to the influence of people who had compute to play around with than deeply insightful or principled exploration of networks.

link

jdeaton 845 days ago

you linked a paper about vision transformers...

link

hervature 845 days ago

Being used as a comparison...

From the abstract:

> Bringing these components together, we are able to build pure CNN architectures without any attention-like operations that are as robust as, or even more robust than, Transformers.

link

hansonw 845 days ago

“RNN-mode inference” is also extremely exciting because you can precompute the hidden state of any prompt prefix (i.e. a long system prompt, or statically retrieved context) and continued generations pay the same cost irrespective of the prefix length.

link

shikon7 845 days ago

But this also means that possible information retained is constant irrespective of the prefix length. This might be a problem if the prefix is composed of essentially uncompressable data.

link

hansonw 845 days ago

Indeed: https://arxiv.org/pdf/2402.01032.pdf Perhaps future iterations of SSMs will accommodate dynamically sized (but still non-linearly-growing) hidden states / memories!

link

5kg 845 days ago

I'd love to see someone who has the resources train a model bigger than 2.8b and show the scaling law still holds.

link

nickpsecurity 845 days ago

Some prior comments said those architectures lack the memory or something of a transformer. That there’s a weakness that’s keeping people using transformers. If true, I’d like to also see tests of various domains with equivalent transformer and Mamba designs to see if that difference impacted anything. From there, we’d have a better idea about whether Mamba-176B is worth the money.

link