Oh, I should have clarified - could one start with a bfloat16 on software-side, convert to float16 (so that e.g a 3.4E38 float16 becomes a 65504 float16), then do any "heavy math" in fast hardware float16 instructions, and then convert back at the end?
Nothing necessarily wrong with that code, but it also kinda smells. Why even store it as bfloat16 at all? You risk getting the numerical disadvantages of both float16-representation, and none of the advantages.