Hacker News new | ask | show | jobs
by sillysaurusx 1086 days ago
At that point it’d be better to do everything in fp32. The hardware can’t do bf16 in the way you’re saying; the conversions would consume all your time.
2 comments

Compute in F32, but then round and pack a pair of BF16 into 4 bytes.
The conversions are just a mask and shift? Super cheap