|
|
|
|
|
by ajtulloch
962 days ago
|
|
It's quite unfavorable on modern hardware. A Sapphire Rapids core can do 2 separate 32 half-precision FMAs (vfmadd132ph, [1]) per clock, which is 128 FLOPs/cycle. It is not possible to achieve that kind of throughput with an 8-bit LUT and accumulation, even just a shuffle with vpshufb is too slow. [1]: https://www.intel.com/content/www/us/en/docs/intrinsics-guid... |
|