|
|
|
|
|
by minimaxir
1086 days ago
|
|
I'm still confused by the proliferation of bf16. Although it certainly doesn't hurt compared to fp16, in my testing even with A100 GPUs optimized for it, both training speed and inference quality are the same between bf16 and fp16. |
|
It's true also that fp16 is often manageable with enough batch/layer norm and gradient clipping.