|
|
|
|
|
by liuliu
505 days ago
|
|
It does. They have 256 experts per MLP layer, and some shared ones. The minimal deployment for decoding (aka. token generation) they recommend is 320 GPUs (H800). It is all in the DeepSeek v3 paper that everyone should read rather than speculating. |
|