| HN Mirror

That doesn't necessarily mean final weights are 8-bit though. Tensor core ops are usually mixed precision- matmul happens in low precision but accumulation (i.e. final result) is done in much higher precision to reduce error.

from deepseek v3:

"For this reason, after careful investigations, we maintain the original precision (e.g., BF16 or FP32) for the following components: the embedding module, the output head, MoE gating modules, normalization operators, and attention operators...To further guarantee numerical stability, we store the master weights, weight gradients, and optimizer states in higher precision. "