| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 37ef_ced3 1819 days ago

Yes, this AVX-512F instruction makes fp16 to fp32 efficient:

https://software.intel.com/sites/landingpage/IntrinsicsGuide...

The result is that Winograd convolutions can achieve an effective FMA rate of twice the peak rate of the CPU.

The Winograd transform reduces the required number of FMAs by a factor of 5x, but you can only do FMAs at half peak rate (because you are bandwidth limited), so you come out ahead by a factor of 2.5x in theory (2x in practice).

Without fp16, that 2x advantage would be lost.