| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by liuliu 551 days ago
	It does. They have 256 experts per MLP layer, and some shared ones. The minimal deployment for decoding (aka. token generation) they recommend is 320 GPUs (H800). It is all in the DeepSeek v3 paper that everyone should read rather than speculating.

1 comments

andrewgross 551 days ago

Got it. I’ll review the paper again for that portion. However, it still sounds like the end result is not VRAM savings but efficiently and speed improvements.

link

liuliu 551 days ago

Yeah, if you look DeepSeek v3 paper deeper, each saving on each axis is understandable. Combined, they reach some magic number people can talk about (10x!): FP8: ~1.6 to 2x faster than BF16 / FP16; MLA: cut KV cache size by 4x (I think); MTP: converges 2x to 3x faster; DualPipe: maybe ~1.2 to 1.5x faster.

If you look deeper, many of these are only applicable to training (we already do FP8 for inference, MTP is to improve training convergence, and DualPipe is to overlapping communication / compute mostly for training purpose too). The efficiency improvement on inference IMHO is overblown.

link

rahimnathwani 550 days ago

  we already do FP8 for inference

Yes but, for a given size of model, Deepseek claims that a model trained with FP8 will work better than a model quantized to FP8. If that's true then, for a given quality, a native FP8 model will be smaller, and have cheaper inference.

link