| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by israrkhan 845 days ago
	MoE (Mixture of Experts) is an effective way to scale transformers. Gemini 1.5 is already doing upto 1 million tokens. I have not seen any large scale mamba model, so not aware of its shortcomings, but I am sure there are tradeoffs. It should be possible to combine Mamba with MoE, I wonder how that would look like... a billion token context?

3 comments

intalentive 845 days ago

https://arxiv.org/abs/2401.04081

https://github.com/jzhang38/LongMamba

link

israrkhan 845 days ago

interesting. This is exactly what I was thinking about. Thanks for sharing

link

nestorD 845 days ago

MoE let's you use scale model size up with compute. That leads to hopefully more intelligent models. It, however, is independent with context size: the ability to process a lot of tokens / text.

link

whimsicalism 845 days ago

nope :) MoE does not scale transformers along sequence length

link