Hacker News new | ask | show | jobs
by EnPissant 27 days ago
This is just wishful thinking.

For prefill, it's really easy to batch MoE and get really good tk/s, even on a single stream.

For decode, you will run into the problem that:

1) you need more parallel requests which means more memory for context

2) 5 requests will not give you very much expert overlap on parallel requests

1 comments

You don't need "very much" expert overlap to see aggregate gains at scale, you just need some of it; that's where the "birthday" framing becomes relevant. Memory for context is an issue, but recent models like DeepSeek V4 use very little of it even at relatively large contexts.
>You don't need "very much" expert overlap to see aggregate gains at scale, you just need some of it

I'm not sure what you are claiming. Decode is bottle-necked by memory bandwidth. To see a speed up of 2x, you have to ensure each expert weight memory fetch can be used by 2 parallel streams. What exactly is the average factor you are claiming for 5x parallel streams (due to "birthday paradox" factors)? The Birthday paradox isn't really relevant here. It's about coverage, not parallelism.

> Memory for context is an issue, but recent models like DeepSeek V4 use very little of it even at relatively large contexts.

This is not true.

An aggregate speedup of 2x is a lot, we don't need that in a local context. Local hardware is heavily constrained by power and thermals, not just bandwidth; so all we really care about is raising compute intensity for decode a little bit to relax the memory bandwidth constraint. The average factor will depend on just how sparse the model is and how far you can push parallelism, there isn't just one single answer.
But you won't see 2x expert re-use, the speedup with 5 streams will be tiny.