|
|
|
|
|
by valine
811 days ago
|
|
The architecture changes are very straight forward. Model merging has shown that pre-trained transformer layers are very robust. I’ll bet it’s possible to fine tune a pre-trained model like mistral to use this architecture. That would enable someone to test it with more parameters without training a whole new base model. |
|