|
|
|
|
|
by torginus
99 days ago
|
|
My understanding is that for MoE with top K architecture, model size doesn't really matter, as you can have 10 32GB experts or a thousand, if only 2-3 of them are active at the same time, your inference workload will be identical, only your hard drive traffic will incread. Which seems to be the case, seeing how hungry the industry lately has been for hard drives. |
|