Hacker News new | ask | show | jobs
by ajtulloch 962 days ago
It's quite unfavorable on modern hardware. A Sapphire Rapids core can do 2 separate 32 half-precision FMAs (vfmadd132ph, [1]) per clock, which is 128 FLOPs/cycle. It is not possible to achieve that kind of throughput with an 8-bit LUT and accumulation, even just a shuffle with vpshufb is too slow.

[1]: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

3 comments

That's absolutely wild. If you really only needed vpshufb, the throughput is the same in terms of values, because there are twice as many values per register and you get to retire half as many instructions, but it takes a bunch more instructions to combine the two inputs and apply a LUT of 256 values :(
>> It's quite unfavorable on modern hardware.

Fair point. It might help if the system is DRAM bandwidth limited, so reducing the data size helps even though individual operations take multiple instructions. But that is not the situation with todays hardware.

Hmmm is there anything preventing dedicated AI chips to have this LUT built in and to vectorise it too?