Hacker News new | ask | show | jobs
by doctoboggan 508 days ago
Its not that aggressive of a quantization considering that the full model was trained at only 8 bits.
2 comments

That doesn't necessarily mean final weights are 8-bit though. Tensor core ops are usually mixed precision- matmul happens in low precision but accumulation (i.e. final result) is done in much higher precision to reduce error.

from deepseek v3:

"For this reason, after careful investigations, we maintain the original precision (e.g., BF16 or FP32) for the following components: the embedding module, the output head, MoE gating modules, normalization operators, and attention operators...To further guarantee numerical stability, we store the master weights, weight gradients, and optimizer states in higher precision. "

That's 16x fewer possible values though (and also just 16 possible values full stop). It would be like giving every person on Earth the same shoe size.
The Henry Ford school of product