|
|
|
|
|
by billconan
600 days ago
|
|
> We show that SSMs with local self-attention, a form of input-dependent input processing, can perform in-context learning analogously to transformers, i.e. through gradient descent steps on an
implicit linear regression problem. I don't understand. The benefit of SSMs is better scalability than self-attention. Now this adds self-attention back? |
|