|
|
|
|
|
by oatsandsugar
612 days ago
|
|
On the page it states: Our novel shared-attention architecture allows more parameters to be allocated to the Mamba2 backbone. In turn, the shared transformer block preserves the rich cross-sequence dependencies of the attention computation. so sounds like it is transformer based? |
|