|
|
|
|
|
by Marat_Dukhan
2778 days ago
|
|
Performance on the plot is higher than FP32 peak, but there's no error - because FBGEMM does not compute in FP32, it computes in 8-bit fixed point. On a Broadwell CPU, you can do 16 FP32 multiply-adds (2x 8-wide FMA instructions via VFMAxxxPS instructions), but 32 8-bit multiply adds (1x 32-wide multiplication with accumulation of adjacent results via VPMADDUSBW instruction). |
|