| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by korbip 768 days ago
	Disclaimer: I'm shared first author of this paper. As a clarification: The speed for training will be on par with FlashAttention-2, when fully optimized and only including the mLSTM. For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture. The sLSTM has memory mixing, that is state tracking capabilities, for problems Transformers and State Space Models (and any other sequence-parallelizable architecture) cannot solve fundamentally.

8 comments

brookst 768 days ago

Congrats on the paper, very interesting.

Can you opine on how the model will fare on hardware that is optimized for transformers? There is so much investment in accelerating the transformer arch[1][2], will xLSTM / sLSTM benefit as well, or will the hardware optimizations give transformers enough of an advantage that it’s hard to compete on general purpose hardware?

1. https://www.etched.com/

2. https://www.embedded.com/ai-chip-features-hardware-support-f...

deepnet 768 days ago

Fascinating work, very promising.

Can you summarise how the model in your paper differs from this implementation of xLSTM ?

https://github.com/huggingface/transformers/issues/27011

korbip 763 days ago

Thanks! I don't see any implementation there. In any case, we are planning a code release soon.

WithinReason 768 days ago

Can you expand on the "cannot solve fundamentally" part?

lucidrains 767 days ago

https://arxiv.org/abs/2404.08819

Der_Einzige 767 days ago

So does anything do proper state tracking? And don’t point to the OP since very often purportedly better new architectures end up being basically vaporware (like mamba or rkwv, which still don’t have good quality pre trained models yet)

impossiblefork 767 days ago

How do you mean vaporware?

Surely whether a big model using a certain system exists is only a matter of the choices of those with sufficient resources to train it. That's only a matter of their beliefs, not about actual model performance.

thomasahle 767 days ago

Transformers and SSMs can't do long computations that are inherently sequential.

Unless you give them chain of thought. In which case they do great.

albertzeyer 768 days ago

Congratulations on the paper. That's some very interesting work!

But you would want to include sLSTM as well to get the best performance, right? How does the speed compares in that case? Specifically when scaling up.

korbip 768 days ago

Thank you! I can say that it is not really a diminishing factor at the scales reported in the paper. So, xLSTM[7:1] is pretty much on par with xLSTM[1:0] in speed. We show that it is helpful on toy tasks, and it shows even better sequence extrapolation performance, so yes.

goldemerald 767 days ago

Great work! I'd love to start using the language model variant of your work. Do you know when/if it will be open sourced? I'd start using it today if it were that soon.

SpaceManNabs 767 days ago

> For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture

Can you explain this statement more if you have time? Are you saying the recurrent architecture of xLSTM enables fast inference on par with Mamba? Or the xLSTM architecture slows it down so that its inference is as slow as mamba?

hh1 767 days ago

When you talk about "c" or "scalar memory" in the paper, does that refer to a single unit in the vector usually referred to as c?

So in mLSTM, each unit of the vector c is now a matrix (so a 3d tensor)? And we refer to each matrix as a head?

Having a bit of issue understanding this fundamental part

korbip 763 days ago

You mainly got it right. Usually one does have many scalar 'c' cells, that talk to each other via memory mixing. For the sLSTM, you group them into heads, talking only to cells within the same head. The reason that we referred to scalar cells here is that these are that fundamental building block. Many of them can and are usually combined and vector notation is useful in this case.

For the matrix 'C' state, there are also heads/cells in that sense that you have multiple, but they don't talk to each other. So yes, you can view that as a 3D tensor. And here, the matrix is the fundamental building block / concept.

logicchains 768 days ago

To clarify, is the sLSTM strictly necessary (to achieve better accuracy than those other architectures), or is the mLSTM good enough? The [1/0] model in the paper seemed to do quite well.

korbip 768 days ago

For language in general it seems fine. But there might be specific tasks where it is necessary indeed.