Hacker News new | ask | show | jobs
by keerthiko 2598 days ago
You can look at the basic algorithm on wikipedia[0], and you'll see that a lot of it is parallelizable (non-sequential dependency of calculations) which by default is easily (time-) optimized via GPU pipes.

After that, there are some caching optimizations to be had by iterating over the nested loops to minimize cache misses. Another optimization is to split up the matrices if their shapes are appropriate and use compounding matrix operations to recombine them, allowing for further use of parallelization in phases.

There are a few fancier algorithms out there which are optimized for certain assumptions -- extremely large matrices, several consecutive operations, matrices with many 0s or many identical entries or following certain patterns for values such as the identity matrix or eigenvectors, etc.

[0] https://en.wikipedia.org/wiki/Matrix_multiplication