|
|
|
|
|
by redox99
1086 days ago
|
|
Sometimes during training, fp16 will cause networks that would converge on fp32, to explode to Infs or NaNs with fp16, because of the limited range. bf16 generally speaking fixes that. It's true also that fp16 is often manageable with enough batch/layer norm and gradient clipping. |
|
The transfer pipeline I wrote for fp32->fp16 also took a lot more work than fp32->bf16