|
|
|
|
|
by aesthesia
34 days ago
|
|
The Nemotron model has attention layers interspersed with the Mamba layers, and I didn't see any attention layers in the model. It looks like the attention layers are present but show up as blocks with an RMSNorm followed by two sequential linear layers. The first few resolution levels aren't very useful either. |
|