Hacker News new | ask | show | jobs
by ffast-math 1462 days ago
Definitely. On CPUs, you could make this 2x faster pretty easily with just another execution port for vpshufb / vtbl and a 4bit lo and hi unpack instruction.

Though the real speedup would be allowing dense matmul ASICs to operate on 16-byte tables and 4-bit indices as operands. The reason Bolt and MADDNESS end up so fast is that they produce "sparse" representations that are still contiguous, strided arrays in memory. So the kernels and access patterns are just like those of dense GEMMs (and therefore vectorize-able, etc), but with lookup-adds instead of multiply-adds.

Hopefully-clarifying image: https://imgur.com/a/trOB69U

1 comments

Fascinating, might try implementing this on an FPGA.