| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by redox99 1086 days ago
	Sometimes during training, fp16 will cause networks that would converge on fp32, to explode to Infs or NaNs with fp16, because of the limited range. bf16 generally speaking fixes that. It's true also that fp16 is often manageable with enough batch/layer norm and gradient clipping.

1 comments

voz_ 1086 days ago

Yea, I spent a few months comparing the two, and empirically i had a lot more issues with various normalized entropy problems (explosion, not converging, converging slower) with fp16 than with bf16.

The transfer pipeline I wrote for fp32->fp16 also took a lot more work than fp32->bf16

link