Hacker News new | ask | show | jobs
by adrian_b 206 days ago
False.

While there are indeed parts of the workloads that must be executed in vector units, those parts are limited by the memory interface throughput, not by the computational throughput.

Only the matrix-matrix operations are limited by the computational throughput, not by the memory throughput, and all matrix-matrix operations (this includes the solving of dense systems of equations, which is the most frequent kind of non-AI workload) are better done with dedicated matrix units, because the matrix units reduce the number of memory transfers that are required for performing matrix operations.

1 comments

> this includes the solving of dense systems of equations

Is there even dedicated hardware for LU?

There is no need for dedicated hardware for LU, because for big matrices LU can be reduced to matrix-matrix multiplications of smaller submatrices.

LU for small matrices and most other operations with small matrices are normally better done in the vector units.

There is a mild lack of context here. If you have a single vector and want to solve LUx=b, you actually have matrix vector multiplication. It's the batched LUX=B case, where X and B are matrices where you need matrix matrix multiplication.

For those who don't know. One of the most useful properties of triangular matrices is that the block matrices in the diagonal blocks are triangular matrices themselves. This means you an solve a subset of the x using the first triangular block. Since the sub-x vector is now known, you can now do a forward multiplication against the non-triangular blocks that take your sub-x vector as input and subtract them from the b vector. This is the same as if you removed one of the columns or rows in the triangular matrix. The remaining matrix stays triangular, which means you can just keep repeating this until the entire matrix is solved.