|
|
|
|
|
by israrkhan
845 days ago
|
|
MoE (Mixture of Experts) is an effective way to scale transformers. Gemini 1.5 is already doing upto 1 million tokens. I have not seen any large scale mamba model, so not aware of its shortcomings, but I am sure there are tradeoffs. It should be possible to combine Mamba with MoE, I wonder how that would look like... a billion token context? |
|
https://github.com/jzhang38/LongMamba