|
It's not rocket science by any means, and the 'well behaved' case for matrix multiplication (sufficiently 'squarish' dimensions, i.e. not too similar to a dot product, running on a known system configuration) has been beaten to death, revived, and beaten to death again many, many times and current libraries can get ~90% of theoretical peak. That being said, there are a lot of nontrivial cases and pitfalls lying around that can cause some grief to someone trying to reinvent the wheel, as well as open theoretical problems that could lead to significant speedups if they're resolved. Here's a short and definitely not complete list of these: - Communication costs (transferring data between fast memory and slow memory, e.g. cache <-> memory or GPU memory <-> main memory) is the bottleneck in most cases. The asymptotic communication costs (both in terms of lower bounds, and in terms of attaining said lower bounds through a judicious choice of tiling sizes) of many of these problems are well-understood (see [Irony-Toledo '04] for the classic work on this, extended to most of linear algebra by [Ballard et al. '14]), but these asymptotic analyses have lots of hidden constants and parameters that need to be optimized for not just a given architecture but also a given problem size. For instance, suppose you have an inner-product or an input of sufficiently similar dimension (i.e. multiplying matrix of dimension mn by matrix of dimension nk, n >> m, k). If you're going to split this thing up into parallel blocks (e.g. through divide-and-conquer along the 'middle dimension' n, since we assume the other dimensions are small enough that you can't split them up too much) you'll have to allocate nksizeof(double, or whatever your matrix elements are) additional memory for each thread, since each thread needs a place to store its sub-result before they're all summed together. The theory will tell you that replicating until you fill up your memory is "optimal" (i.e. within a constant factor of the theoretical communication lower bound), but that constant can be fairly large (integer-factor in many cases), and optimizing that mostly is a matter of experimentation and intuition. In fact, back in 2013 the latest release of both Intel MKL and PLASMA did not deal with this case particularly well, although I haven't tried more recent versions. - Floating point addition is not associative and if you're using parallelism, there's no guarantee that two subsequent runs of the same program will give you the same result since the additions can happen in any order. For some cases, most notably debugging (but also times when you need code that meets certain verification or contractual requirements - you can't have code giving different results at different iterations) you want 'reproducibility', which is very nontrivial to implement (see: ReproBLAS project). - Precision issues: Do you need double-precision numbers? or will a single-precision or something smaller work? Often problems are more sensitive to certain results than others, so you'd want to switch between different precisions at different stages in your problem for additional optimization (see: Rubio Gonzalez et al.'s paper on Precimonious). - Matrix multiplication (and therefore all the other algorithms that depend on it, e.g. GEPP, TRSM, Cholesky...) has significantly faster algorithms than the classical O(n^3) naive one. Strassen's method (runtime ~O(n^2.8)) is sometimes used in practice, but there are algorithms with significantly better exponents (I believe the world record is currently held by Le Gall at somewhere around 2.37). Unfortunately, these algorithms tend to have massive constant factors render them impractical for real-world use, and as far as I know figuring out if these can actually be applied these in practice is an open problem. Reducing the exponent (or finding lower bounds for it) is also an area of active research of importance to not just the numerical linear algebra community but also to the complexity theory crowd (since a lot of problems can be reduced to matrix multiplication). - Advances in taking advantage of matrix structure to speed up linear solvers, most notably with Laplacian/strictly diagonally dominant solvers started by Spielman and Teng in '04 and built upon in recent years. As with the faster matrix multiplication algorithms many of these solvers are not yet competitive compared to stuff used in practice (multigrid methods and the like), which don't have the same kinds of guarantees but run quickly in practice. (I don't believe that there's been a whole lot of work in terms of actually implementing and optimizing of Laplacian solvers either). In the theory world it's also interesting to ask how far these efficient methods will go (how large is the class of systems that you can solve quickly?) - Kyng and Zhang have a paper this year, for instance, stating that fast (nearly-linear in the number of nonzeros of the coefficient matrix) solvers for a slight generalization of Laplacian systems (multicommodity Laplacians) would lead to similarly fast linear solvers for all linear equations, so either there are limits on how much you can take advantage of structure or we're going to see some super-fast algorithms for general linear systems. |