To me it looks like a clever optimization. Same range as FP32, but half the size and less precise and can be converted back and forth by truncating and concatenating zeros.
Google uses it on their TPUs [0]. If you're interested in how it would effect the numerical stability of an algorithm you want to use, there is a Julia package that makes prototyping linear algebra over this datatype pretty straightforward [1].
And Facebook is taking this even further. And while all these things are very cool, do not let ASIC designers claim they are barriers to entry for GPUs and CPUs. Whatever variants of this precision potpourri catch on are but a generation away from incarnation in general processors IMO...
I would be extremely surprised if the motivation for putting bfloat16 in tensorflow was not the TPU. That first public commit was ~1.5 years before TPUv2 was announced at I/O, so it was almost certainly already in development.
bfloat16 was first in DistBelief, so it actually predates TensorFlow and TPUs (I worked on both systems). IIRC the motivation was more about minimizing parameter exchange bandwidth for large-scale CPU clusters rather than minimizing memory bandwidth within accelerators, but the idea generalized.
Why is it clever to change the mantissa and exponent size? I thought the clever ones were the nervana flexpoint which seemed at least partially novel. And it's interesting Intel isn't pushing that format given nervana's asic had it.
[0] https://cloud.google.com/tpu/docs/system-architecture
[1] https://github.com/JuliaComputing/BFloat16s.jl