|
|
|
|
|
by liuliu
501 days ago
|
|
Yeah, if you look DeepSeek v3 paper deeper, each saving on each axis is understandable. Combined, they reach some magic number people can talk about (10x!): FP8: ~1.6 to 2x faster than BF16 / FP16; MLA: cut KV cache size by 4x (I think); MTP: converges 2x to 3x faster; DualPipe: maybe ~1.2 to 1.5x faster. If you look deeper, many of these are only applicable to training (we already do FP8 for inference, MTP is to improve training convergence, and DualPipe is to overlapping communication / compute mostly for training purpose too). The efficiency improvement on inference IMHO is overblown. |
|