(Not an ML guy.) bf16 and fp16 should be comparable if the weights are of the same magnitude, but what happens in a network where the weights are poorly regularized?
Someone commented below that with enough batchnorm/layernorm/etc. and/or gradient clipping you can manage it, but BF16 just makes life easier if you can live without some precision.