| HN Mirror

Basically no one uses FP32 at inference time. BF16/FP16 is typically considered unquantized, whereas FP8 is lightly quantized. That being said there's pretty minimal quality loss at FP8 compared to 16-bit typically; Llama 3.1 405b, for example, only benchmarks around ~1% worse when run at FP8: https://blog.vllm.ai/2024/07/23/llama31.html

Every major inference provider other than Hyperbolic Labs runs Llama 3.1 405b at FP8, FWIW (e.g. Together, Fireworks, Lepton), so to compare against FP32 is misleading to say the least. Even Hyperbolic runs it at BF16.

Pretraining is typically done in FP32, although some labs (e.g. Character AI, RIP) apparently train in INT8: https://research.character.ai/optimizing-inference/