Hacker News new | ask | show | jobs
by danielhanchen 500 days ago
There are a few ways - the most basic is per layer sharding - DeepSeek uses 3 dense layers, so that can stay on GPU0 (with the embedding layer). There's 58 MoE layers (256 experts, 8 activated) and 1 shared expert per layer. GPU1 would house layers 3 to 9, and so on.

Then by using pipeline parallelism, if a new request comes, we simply stick them in a queue - GPUs 0, 1, 2, ..., 8. Request A is at GPU 2, Request B at GPU 1, Request C at GPU 0 and so on.

The other option is tensor parallelism were we split the weights evenly. You could combine pipeline and tensor parallelism as well!