Hacker News new | ask | show | jobs
by radarsat1 3187 days ago
> What you're talking about is parallelisation, moving to GPU, and modern (combinatorial) methods for sparse systems, and that's fairly cutting edge, and not trivial to implement/port.

Honestly, it might be tricky, but implementing matrix operations is not rocket science either. I find it incredible that so many projects rely on NVidia's proprietary libraries for doing this on GPU. Maybe there is some secret juice that I just don't know about, but it seems to me there can't be that many optimisation shortcuts for matrix multiplications and the like that require intimate, secret knowledge of the hardware.

4 comments

> Honestly, it might be tricky, but implementing matrix operations is not rocket science either.

Actually, it's very hard to implement efficient algebraic matrix-matrix and matrix-vector operations, although naive implementations are very easy to pull off. You're fooling yourself if you believe you can whip out an implementation for basic BLAS-(1|2|3) kernels that matches the performance of properly tuned implementations. Implementing a kernel whose performance doesn't stray too far from the hardware's capacity takes a lot of knowledge and work on low-level details such as cache hierarchies and its impact on the memory access performance. Floating point operations actually take a back-seat to memory access, as they represent a small fraction of the operations being performed by the kernel (IIRC, in sparse matrix operations the proportion of fp operations is only about 1-in-7) and the bulk of the implementation is focused on memory access that minimizes cache misses. Therefore, to even be in a position to implement an acceptable matrix-vector or matrix-matrix kernel you need to have a solid understanding on how particular performances handle memory. This isn't trivial, and it's one of the reasons why articles about implementing X on a GPU, even if X is a classic algorithm, are accepted and published by specialized publications.

The hard part isn’t getting it basically right. The hard part is knowing where subtle bugs are.

I remember my numerical computing textbook had one major takeaway: almost always write your own matrix algorithms; almost never use them.

An extraordinary amount of work has gone into linear algebra computation and much of it for mission critical stuff. It might not be “rocket science” but it’s close.

This is an important point, but I don't think it's insurmountable for a broader group of people.

Though technically not required, I think it helps to know a good deal of mathematical theory as well in order to check properties that should hold, which helps in tracking down subtle bugs. In my experience, tracking these bugs is a royal pain in the ass and writing good tests for these properties is tedious. Even then, there's a lot of funny stuff that goes on. One of my favorites is that it's possible to take the Choleski factorization of an indefinite matrix in some situations. That means it's not safe to use a Choleski to check for whether or a not a matrix is positive definite for mission critical functions:

https://scicomp.stackexchange.com/a/26223

But, anyway, mostly I contend that writing these algorithms is extremely tedious, so there's not that many people with the patience. I also still maintain that there's room for contribution here. Although most often these routines are written in C and Fortran, the parallel tooling for these languages is less sophisticated than that of modern languages. I've been wondering whether a language like Rust would make it easier to code and parallelize some of these algorithms.

> implementing matrix operations is not rocket science either.

Naively implementing them sure, but looking at OpenBLAS/LAPACK/etc code it seems pretty close to rocket science

It's not rocket science by any means, and the 'well behaved' case for matrix multiplication (sufficiently 'squarish' dimensions, i.e. not too similar to a dot product, running on a known system configuration) has been beaten to death, revived, and beaten to death again many, many times and current libraries can get ~90% of theoretical peak.

That being said, there are a lot of nontrivial cases and pitfalls lying around that can cause some grief to someone trying to reinvent the wheel, as well as open theoretical problems that could lead to significant speedups if they're resolved. Here's a short and definitely not complete list of these:

- Communication costs (transferring data between fast memory and slow memory, e.g. cache <-> memory or GPU memory <-> main memory) is the bottleneck in most cases. The asymptotic communication costs (both in terms of lower bounds, and in terms of attaining said lower bounds through a judicious choice of tiling sizes) of many of these problems are well-understood (see [Irony-Toledo '04] for the classic work on this, extended to most of linear algebra by [Ballard et al. '14]), but these asymptotic analyses have lots of hidden constants and parameters that need to be optimized for not just a given architecture but also a given problem size. For instance, suppose you have an inner-product or an input of sufficiently similar dimension (i.e. multiplying matrix of dimension mn by matrix of dimension nk, n >> m, k). If you're going to split this thing up into parallel blocks (e.g. through divide-and-conquer along the 'middle dimension' n, since we assume the other dimensions are small enough that you can't split them up too much) you'll have to allocate nksizeof(double, or whatever your matrix elements are) additional memory for each thread, since each thread needs a place to store its sub-result before they're all summed together. The theory will tell you that replicating until you fill up your memory is "optimal" (i.e. within a constant factor of the theoretical communication lower bound), but that constant can be fairly large (integer-factor in many cases), and optimizing that mostly is a matter of experimentation and intuition. In fact, back in 2013 the latest release of both Intel MKL and PLASMA did not deal with this case particularly well, although I haven't tried more recent versions.

- Floating point addition is not associative and if you're using parallelism, there's no guarantee that two subsequent runs of the same program will give you the same result since the additions can happen in any order. For some cases, most notably debugging (but also times when you need code that meets certain verification or contractual requirements - you can't have code giving different results at different iterations) you want 'reproducibility', which is very nontrivial to implement (see: ReproBLAS project).

- Precision issues: Do you need double-precision numbers? or will a single-precision or something smaller work? Often problems are more sensitive to certain results than others, so you'd want to switch between different precisions at different stages in your problem for additional optimization (see: Rubio Gonzalez et al.'s paper on Precimonious).

- Matrix multiplication (and therefore all the other algorithms that depend on it, e.g. GEPP, TRSM, Cholesky...) has significantly faster algorithms than the classical O(n^3) naive one. Strassen's method (runtime ~O(n^2.8)) is sometimes used in practice, but there are algorithms with significantly better exponents (I believe the world record is currently held by Le Gall at somewhere around 2.37). Unfortunately, these algorithms tend to have massive constant factors render them impractical for real-world use, and as far as I know figuring out if these can actually be applied these in practice is an open problem. Reducing the exponent (or finding lower bounds for it) is also an area of active research of importance to not just the numerical linear algebra community but also to the complexity theory crowd (since a lot of problems can be reduced to matrix multiplication).

- Advances in taking advantage of matrix structure to speed up linear solvers, most notably with Laplacian/strictly diagonally dominant solvers started by Spielman and Teng in '04 and built upon in recent years. As with the faster matrix multiplication algorithms many of these solvers are not yet competitive compared to stuff used in practice (multigrid methods and the like), which don't have the same kinds of guarantees but run quickly in practice. (I don't believe that there's been a whole lot of work in terms of actually implementing and optimizing of Laplacian solvers either). In the theory world it's also interesting to ask how far these efficient methods will go (how large is the class of systems that you can solve quickly?) - Kyng and Zhang have a paper this year, for instance, stating that fast (nearly-linear in the number of nonzeros of the coefficient matrix) solvers for a slight generalization of Laplacian systems (multicommodity Laplacians) would lead to similarly fast linear solvers for all linear equations, so either there are limits on how much you can take advantage of structure or we're going to see some super-fast algorithms for general linear systems.