|
|
|
|
|
by ascar
1106 days ago
|
|
Cache locality is certainly an important aspect and especially when it comes to matrix multiplication even just changing the loop order (and thus access pattern) has a huge performance impact. However, an O(n^3) algorithm will always loose out to an O(n^2*log(n)) algorithm, when the input gets big enough. And the difference between these two might be as simple as sorting your input data first. I teach parallel programming to graduates and the first exercise we give them is a sequential optimization for exactly that reason. Think about if your algorithm is efficient before thinking about all the challenges that come with parallelization. |
|
You have to be very careful about what 'big enough' means. In practice, Strassen multiplication is not faster than the naive algorithm until you get to the point where you're multiplying matrices with hundreds of rows/columns. Additionally, naive matrix multiplication is well suited to GPUs, while Strassen multiplication on the GPU requires temporary buffers and multiple jobs and sequencing and whatnot.
As a general rule, matrix multiplication with complexity better than the naive algorithm should probably not be used. Do naive matrix multiplication on the CPU. If you need it to be faster, do naive matrix multiplication on the GPU. If you need it to be faster, the numerical stability of your problem has probably already come a gutser and will get worse if you switch to Strassen or any of the other asymptotically faster algorithms.
And the algorithms faster than Strassen? Forget about it. After Strassen multiplication was invented, about a dozen or so other algorithms came along, slowly reducing that O(n^2.8) to about O(n^2.37188) or so. (most recently in 2022; this is still an area of active research) The problem is that for any of these algorithms to be faster than Strassen, you need matrices that are larger than what you can keep in memory. There is no big enough input that will fit in the RAM of a modern computer. One estimate I've heard is that if you convert every atom in the observable universe into one bit of RAM, and you use that RAM to multiply two 10^38 by 10^38 matrices to get a third 10^38 by 10^38 matrix, you're still better off using the O(n^2.8) Strassen multiplication instead of the state of the art O(n^2.37188) algorithm. The constant slowdown in the other algorithms really are that bad.