|
|
|
|
|
by ffast-math
1462 days ago
|
|
Definitely. On CPUs, you could make this 2x faster pretty easily with just another execution port for vpshufb / vtbl and a 4bit lo and hi unpack instruction. Though the real speedup would be allowing dense matmul ASICs to operate on 16-byte tables and 4-bit indices as operands. The reason Bolt and MADDNESS end up so fast is that they produce "sparse" representations that are still contiguous, strided arrays in memory. So the kernels and access patterns are just like those of dense GEMMs (and therefore vectorize-able, etc), but with lookup-adds instead of multiply-adds. Hopefully-clarifying image: https://imgur.com/a/trOB69U |
|