| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by minimaxir 1086 days ago
	I'm still confused by the proliferation of bf16. Although it certainly doesn't hurt compared to fp16, in my testing even with A100 GPUs optimized for it, both training speed and inference quality are the same between bf16 and fp16.

6 comments

redox99 1086 days ago

Sometimes during training, fp16 will cause networks that would converge on fp32, to explode to Infs or NaNs with fp16, because of the limited range. bf16 generally speaking fixes that.

It's true also that fp16 is often manageable with enough batch/layer norm and gradient clipping.

link

voz_ 1086 days ago

Yea, I spent a few months comparing the two, and empirically i had a lot more issues with various normalized entropy problems (explosion, not converging, converging slower) with fp16 than with bf16.

The transfer pipeline I wrote for fp32->fp16 also took a lot more work than fp32->bf16

link

dlewis1788 1086 days ago

My understanding is for certain types of networks BF16 will train better than FP16, given the additional protection against exploding gradients and loss functions with the extended range of BF16 - at the loss of precision.

link

YetAnotherNick 1086 days ago

bf16 is generally easier to train neural network than fp16 on due to no need for scaling. And most model training and inference performs the same with fp32 and bf16.

link

bravura 1086 days ago

Despite the other answers, I will tell you the grim truth: Your mileage might vary.

It's an empirical question and depends upon the nature of your problem and data. You should try all three fp32, fp16, and bf16 as part our model selection / hyperparameter tuning.

For example, in audio generative models (where typical output is 16-bit), I've sometimes found that fp16 and bf16 just don't produce good output as fp32 weights.

link

gok 1086 days ago

Fp16 makes it easy to accidentally overflow, especially around summation operations.

link

bobbylarrybobby 1086 days ago

(Not an ML guy.) bf16 and fp16 should be comparable if the weights are of the same magnitude, but what happens in a network where the weights are poorly regularized?

link

dlewis1788 1086 days ago

Someone commented below that with enough batchnorm/layernorm/etc. and/or gradient clipping you can manage it, but BF16 just makes life easier if you can live without some precision.

link