| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ajtulloch 962 days ago
	It's quite unfavorable on modern hardware. A Sapphire Rapids core can do 2 separate 32 half-precision FMAs (vfmadd132ph, [1]) per clock, which is 128 FLOPs/cycle. It is not possible to achieve that kind of throughput with an 8-bit LUT and accumulation, even just a shuffle with vpshufb is too slow. [1]: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

3 comments

anonymoushn 961 days ago

That's absolutely wild. If you really only needed vpshufb, the throughput is the same in terms of values, because there are twice as many values per register and you get to retire half as many instructions, but it takes a bunch more instructions to combine the two inputs and apply a LUT of 256 values :(

link

phkahler 960 days ago

>> It's quite unfavorable on modern hardware.

Fair point. It might help if the system is DRAM bandwidth limited, so reducing the data size helps even though individual operations take multiple instructions. But that is not the situation with todays hardware.

link

mattsan 957 days ago

Hmmm is there anything preventing dedicated AI chips to have this LUT built in and to vectorise it too?

link