|
|
|
|
|
by EnPissant
27 days ago
|
|
This is just wishful thinking. For prefill, it's really easy to batch MoE and get really good tk/s, even on a single stream. For decode, you will run into the problem that: 1) you need more parallel requests which means more memory for context 2) 5 requests will not give you very much expert overlap on parallel requests |
|