| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by billconan 600 days ago
	> We show that SSMs with local self-attention, a form of input-dependent input processing, can perform in-context learning analogously to transformers, i.e. through gradient descent steps on an implicit linear regression problem. I don't understand. The benefit of SSMs is better scalability than self-attention. Now this adds self-attention back?

1 comments

pitpatagain 598 days ago

It adds a very local sliding window attention, the context is only 3 adjacent frames per step. They need the access to adjacent frames to show the implicit model gradient computation but I didn't yet follow the derivation for why this is so.

link