|
|
|
|
|
by ribit
851 days ago
|
|
It is also interesting how this relates to hardware implementations. Nvidia and Intel AMX appear to be using dot product engines under the hood and do a matrix multiplication in a single instruction. Apple AMX and ARM SME use outer product engines and require multiple instructions to do a single matrix multiplication. |
|
Moreover, dot product throughput is limited by the memory read throughput.
Any matrix-matrix product implementation is best done based on tensor products of vectors, because each such product is composed of independent operations, so their latencies can be hidden. Moreover, a tensor product requires a number of multiplications equal to the product of the sizes of the operands, but a number of loads from memory equal to the sum of the sizes of the operands.
With enough registers to store the matrix result, it is easy to ensure that the product of the operand sizes is greater than the sum of the operand sizes, so that the throughput of the memory reads does not limit the attainable performance.
Where a matrix-matrix product is done by a single instruction, it normally also uses tensor products. Only when both the input operands and the result are stored in registers, the instruction could also be implemented by AXPY operations (where the fused multiply-add operations are also independent), but not by dot products (with dependent FMAs that prevent pipelining).
"AXPY" is a name that comes from the BLAS library and it refers to an operation fundamental in linear algebra, "A times vector X Plus vector Y". There are many cases when it is possible to choose between AXPY and scalar products. AXPY is normally the right choice, because it is composed of independent FMAs, which can be interleaved and pipelined.