| vLLM supports MLA for Deepseek models as of 3 weeks ago. 3x higher generation throughput and 10x token memory capacity. https://github.com/vllm-project/vllm/releases/tag/v0.7.1 MHA is still faster in low QPS regime apparently. https://neuralmagic.com/blog/enhancing-deepseek-models-with-... Also published this month was theoretical proof showing that for the same KV Cache overhead, MLA consistently offers greater expressive power than GQA. Furthermore, widely used GQA-based pre-trained models (e.g. LLaMA, Qwen, Mixtral) can be converted into MLA-based models. https://arxiv.org/pdf/2502.07864 |
I am very curious to see how well-optimized Deepseek's code is compared to leading LLM serving softwares like vLLM or SGLang.