Hacker News new | ask | show | jobs
by fermuch 767 days ago
Would something like this apply to MAMBA/JAMBA too?
1 comments

I think any next token predictor will benefit. Iiuc mamba is a next token predictor.

I just skimmed the gradient article, but if their only change is swapping out the transformer block for the mamba block, I don't think it's already using this optimization