Hacker News new | ask | show | jobs
by robertknight 494 days ago
One interesting thing I discovered comparing various matrix multiplication implementations used in ML libraries is that several of them (ONNX Runtime, XNNPack, any others?) skip the step, from BLIS's textbook algorithm, of packing the LHS matrix. Instead they pack only the RHS. Since those are the weights, this can be done once ahead of time and then an inference pass does not need to do any packing at all.

From skimming various papers it seems like the motivation for packing the LHS originally, even though a single element is broadcast from it at a time (nb. this is opposite to the order in this post, where the row count in the microkernel is a multiple of the register size, rather than the column count), was to reduce TLB misses. Apparently this is not a problem in practice on modern CPUs and for problem sizes common in ML inference.