Hacker News new | ask | show | jobs
by adrian_b 851 days ago
Nobody is using dot product engines, because dot product throughput is limited by the latency of fused multiply-add, instead of by the clock frequency.

Moreover, dot product throughput is limited by the memory read throughput.

Any matrix-matrix product implementation is best done based on tensor products of vectors, because each such product is composed of independent operations, so their latencies can be hidden. Moreover, a tensor product requires a number of multiplications equal to the product of the sizes of the operands, but a number of loads from memory equal to the sum of the sizes of the operands.

With enough registers to store the matrix result, it is easy to ensure that the product of the operand sizes is greater than the sum of the operand sizes, so that the throughput of the memory reads does not limit the attainable performance.

Where a matrix-matrix product is done by a single instruction, it normally also uses tensor products. Only when both the input operands and the result are stored in registers, the instruction could also be implemented by AXPY operations (where the fused multiply-add operations are also independent), but not by dot products (with dependent FMAs that prevent pipelining).

"AXPY" is a name that comes from the BLAS library and it refers to an operation fundamental in linear algebra, "A times vector X Plus vector Y". There are many cases when it is possible to choose between AXPY and scalar products. AXPY is normally the right choice, because it is composed of independent FMAs, which can be interleaved and pipelined.

1 comments

Thank you, very insightful and makes perfect sense! I do wonder however why Nvidia and Intel chose not to expose an AXPY/outer product instruction if they use these kinds of operations under the hood. I can imagine them being useful in their own right. My best guess is that this gives them freedom to change the implementation details later on (e.g. the order of swizzles)?