Hacker News new | ask | show | jobs
by redox99 1086 days ago
Sometimes during training, fp16 will cause networks that would converge on fp32, to explode to Infs or NaNs with fp16, because of the limited range. bf16 generally speaking fixes that.

It's true also that fp16 is often manageable with enough batch/layer norm and gradient clipping.

1 comments

Yea, I spent a few months comparing the two, and empirically i had a lot more issues with various normalized entropy problems (explosion, not converging, converging slower) with fp16 than with bf16.

The transfer pipeline I wrote for fp32->fp16 also took a lot more work than fp32->bf16