Hacker News new | ask | show | jobs
by israrkhan 845 days ago
MoE (Mixture of Experts) is an effective way to scale transformers. Gemini 1.5 is already doing upto 1 million tokens. I have not seen any large scale mamba model, so not aware of its shortcomings, but I am sure there are tradeoffs.

It should be possible to combine Mamba with MoE, I wonder how that would look like... a billion token context?

3 comments

interesting. This is exactly what I was thinking about. Thanks for sharing
MoE let's you use scale model size up with compute. That leads to hopefully more intelligent models. It, however, is independent with context size: the ability to process a lot of tokens / text.
nope :) MoE does not scale transformers along sequence length