|
|
|
|
|
by Kubuxu
502 days ago
|
|
Not sure about DeepSeek R1, but you are right in regards to previous MoE architectures. It doesn’t reduce memory usage, as each subsequent token might require different expert buy it reduces per token compute/bandwidth usage.
If you place experts in different GPUs, and run batched inference you would see these benefits. |
|
I could be very wrong on how experts work across layers though, I have only done a naive reading on it so far.