|
|
|
|
|
by 37ef_ced3
1772 days ago
|
|
Yes, this AVX-512F instruction makes fp16 to fp32 efficient: https://software.intel.com/sites/landingpage/IntrinsicsGuide... The result is that Winograd convolutions can achieve an effective FMA rate of twice the peak rate of the CPU. The Winograd transform reduces the required number of FMAs by a factor of 5x, but you can only do FMAs at half peak rate (because you are bandwidth limited), so you come out ahead by a factor of 2.5x in theory (2x in practice). Without fp16, that 2x advantage would be lost. |
|