Hacker News new | ask | show | jobs
by intalentive 845 days ago
Nice post. A couple things to add:

1. The Mamba co-author was also the FlashAttention lead author.

2. The secret ingredient that makes SSMs viable for deep learning is HiPPO theory. If you start with random initialization you're not going to get results. What you need is "optimal online function approximation" using Legendre polynomials, a Fourier basis, etc., in matrix form. The Mamba story starts with Legendre Memory Units.

Invariably someone comments, "How do we know that it scales?" We don't. But the lead author has backing and a new startup at cartesia.ai. Could be the next Mistral.

1 comments

The architecture is completely public. I would be surprised if certain other players (including but not limited to Mistral AI) are not training models yet. We'll hear soon enough if this is viable. Maybe not for official release candidates, but at least for internal testing.
Nonetheless, this is extremely exciting, unlike RWKV and Retention Network
Why? From what I read those architectures have many similarities (and same weaknesses)