|
|
|
|
|
by EnPissant
18 days ago
|
|
>You don't need "very much" expert overlap to see aggregate gains at scale, you just need some of it I'm not sure what you are claiming. Decode is bottle-necked by memory bandwidth. To see a speed up of 2x, you have to ensure each expert weight memory fetch can be used by 2 parallel streams. What exactly is the average factor you are claiming for 5x parallel streams (due to "birthday paradox" factors)? The Birthday paradox isn't really relevant here. It's about coverage, not parallelism. > Memory for context is an issue, but recent models like DeepSeek V4 use very little of it even at relatively large contexts. This is not true. |
|