Hacker News new | ask | show | jobs
LU Factorization and Linear Systems for Programers (dragan.rocks)
131 points by disaster01 3188 days ago
7 comments

By the way, if anyone is interested in good open source opportunities, computational linear algebra is nowhere near a solved problem and there's good opportunity for impactful contribution. The computational challenges of the algebra versus factorizations is one angle. Dense versus sparse is another. Shared memory parallelization vs distributed memory vs GPUs is another. Even on the GPU, there are different strategies depending on whether or not the entire matrix fits on a single GPU or if we have to use multiple GPUs. Incomplete or multilevel direct methods used as effective preconditioners for iterative methods are also important. Hell, even efficient direct techniques embedded in indirect solvers is important.

Part of the way to get started would be to look at something like a general numerical linear algebra book like Numerical Linear Algebra from Trefethen and Bau. There are better computational algorithms than what they present, but they do a good job at introducing important factorizations and why we care about them. Then, have a look at Tim Davis' book Direct Methods for Sparse Linear Systems. The codes in that book are online. Then, try to reimplement these algorithms in other languages, parallelize them, or make them better. These are good algorithms, but there are better and Tim's more recent codes are actively used by both MATLAB and Octave. Then, look for missing routines in open source libraries. For example, I just did a quick look and MAGMA currently lists missing routines between them and LAPACK.

Anyway, it's not a field for everyone, but it's one that good architecture and parallelization knowledge can have a positive impact. Nearly all engineering codes depend on good solvers, so the impact is wide.

Though, the classic dense methods have been implemented and optimised and ported to death with LAPACK/BLAS, haven't they.

What you're talking about is parallelisation, moving to GPU, and modern (combinatorial) methods for sparse systems, and that's fairly cutting edge, and not trivial to implement/port.

You'll need to have a pretty good understanding of the language and its paradigmatic use, and of linear algebra, and of modern computer architecture.

However, I assume that you're right in that there might still be some low-hanging fruit.

BTW, Julia is an awesome language with excellent LA support, and it's nice in that most algorithms are coded in Julia itself (unlike the two-language situation in Python, Clojure (AFAIK), etc.)

> What you're talking about is parallelisation, moving to GPU, and modern (combinatorial) methods for sparse systems, and that's fairly cutting edge, and not trivial to implement/port.

Honestly, it might be tricky, but implementing matrix operations is not rocket science either. I find it incredible that so many projects rely on NVidia's proprietary libraries for doing this on GPU. Maybe there is some secret juice that I just don't know about, but it seems to me there can't be that many optimisation shortcuts for matrix multiplications and the like that require intimate, secret knowledge of the hardware.

> Honestly, it might be tricky, but implementing matrix operations is not rocket science either.

Actually, it's very hard to implement efficient algebraic matrix-matrix and matrix-vector operations, although naive implementations are very easy to pull off. You're fooling yourself if you believe you can whip out an implementation for basic BLAS-(1|2|3) kernels that matches the performance of properly tuned implementations. Implementing a kernel whose performance doesn't stray too far from the hardware's capacity takes a lot of knowledge and work on low-level details such as cache hierarchies and its impact on the memory access performance. Floating point operations actually take a back-seat to memory access, as they represent a small fraction of the operations being performed by the kernel (IIRC, in sparse matrix operations the proportion of fp operations is only about 1-in-7) and the bulk of the implementation is focused on memory access that minimizes cache misses. Therefore, to even be in a position to implement an acceptable matrix-vector or matrix-matrix kernel you need to have a solid understanding on how particular performances handle memory. This isn't trivial, and it's one of the reasons why articles about implementing X on a GPU, even if X is a classic algorithm, are accepted and published by specialized publications.

The hard part isn’t getting it basically right. The hard part is knowing where subtle bugs are.

I remember my numerical computing textbook had one major takeaway: almost always write your own matrix algorithms; almost never use them.

An extraordinary amount of work has gone into linear algebra computation and much of it for mission critical stuff. It might not be “rocket science” but it’s close.

This is an important point, but I don't think it's insurmountable for a broader group of people.

Though technically not required, I think it helps to know a good deal of mathematical theory as well in order to check properties that should hold, which helps in tracking down subtle bugs. In my experience, tracking these bugs is a royal pain in the ass and writing good tests for these properties is tedious. Even then, there's a lot of funny stuff that goes on. One of my favorites is that it's possible to take the Choleski factorization of an indefinite matrix in some situations. That means it's not safe to use a Choleski to check for whether or a not a matrix is positive definite for mission critical functions:

https://scicomp.stackexchange.com/a/26223

But, anyway, mostly I contend that writing these algorithms is extremely tedious, so there's not that many people with the patience. I also still maintain that there's room for contribution here. Although most often these routines are written in C and Fortran, the parallel tooling for these languages is less sophisticated than that of modern languages. I've been wondering whether a language like Rust would make it easier to code and parallelize some of these algorithms.

> implementing matrix operations is not rocket science either.

Naively implementing them sure, but looking at OpenBLAS/LAPACK/etc code it seems pretty close to rocket science

It's not rocket science by any means, and the 'well behaved' case for matrix multiplication (sufficiently 'squarish' dimensions, i.e. not too similar to a dot product, running on a known system configuration) has been beaten to death, revived, and beaten to death again many, many times and current libraries can get ~90% of theoretical peak.

That being said, there are a lot of nontrivial cases and pitfalls lying around that can cause some grief to someone trying to reinvent the wheel, as well as open theoretical problems that could lead to significant speedups if they're resolved. Here's a short and definitely not complete list of these:

- Communication costs (transferring data between fast memory and slow memory, e.g. cache <-> memory or GPU memory <-> main memory) is the bottleneck in most cases. The asymptotic communication costs (both in terms of lower bounds, and in terms of attaining said lower bounds through a judicious choice of tiling sizes) of many of these problems are well-understood (see [Irony-Toledo '04] for the classic work on this, extended to most of linear algebra by [Ballard et al. '14]), but these asymptotic analyses have lots of hidden constants and parameters that need to be optimized for not just a given architecture but also a given problem size. For instance, suppose you have an inner-product or an input of sufficiently similar dimension (i.e. multiplying matrix of dimension mn by matrix of dimension nk, n >> m, k). If you're going to split this thing up into parallel blocks (e.g. through divide-and-conquer along the 'middle dimension' n, since we assume the other dimensions are small enough that you can't split them up too much) you'll have to allocate nksizeof(double, or whatever your matrix elements are) additional memory for each thread, since each thread needs a place to store its sub-result before they're all summed together. The theory will tell you that replicating until you fill up your memory is "optimal" (i.e. within a constant factor of the theoretical communication lower bound), but that constant can be fairly large (integer-factor in many cases), and optimizing that mostly is a matter of experimentation and intuition. In fact, back in 2013 the latest release of both Intel MKL and PLASMA did not deal with this case particularly well, although I haven't tried more recent versions.

- Floating point addition is not associative and if you're using parallelism, there's no guarantee that two subsequent runs of the same program will give you the same result since the additions can happen in any order. For some cases, most notably debugging (but also times when you need code that meets certain verification or contractual requirements - you can't have code giving different results at different iterations) you want 'reproducibility', which is very nontrivial to implement (see: ReproBLAS project).

- Precision issues: Do you need double-precision numbers? or will a single-precision or something smaller work? Often problems are more sensitive to certain results than others, so you'd want to switch between different precisions at different stages in your problem for additional optimization (see: Rubio Gonzalez et al.'s paper on Precimonious).

- Matrix multiplication (and therefore all the other algorithms that depend on it, e.g. GEPP, TRSM, Cholesky...) has significantly faster algorithms than the classical O(n^3) naive one. Strassen's method (runtime ~O(n^2.8)) is sometimes used in practice, but there are algorithms with significantly better exponents (I believe the world record is currently held by Le Gall at somewhere around 2.37). Unfortunately, these algorithms tend to have massive constant factors render them impractical for real-world use, and as far as I know figuring out if these can actually be applied these in practice is an open problem. Reducing the exponent (or finding lower bounds for it) is also an area of active research of importance to not just the numerical linear algebra community but also to the complexity theory crowd (since a lot of problems can be reduced to matrix multiplication).

- Advances in taking advantage of matrix structure to speed up linear solvers, most notably with Laplacian/strictly diagonally dominant solvers started by Spielman and Teng in '04 and built upon in recent years. As with the faster matrix multiplication algorithms many of these solvers are not yet competitive compared to stuff used in practice (multigrid methods and the like), which don't have the same kinds of guarantees but run quickly in practice. (I don't believe that there's been a whole lot of work in terms of actually implementing and optimizing of Laplacian solvers either). In the theory world it's also interesting to ask how far these efficient methods will go (how large is the class of systems that you can solve quickly?) - Kyng and Zhang have a paper this year, for instance, stating that fast (nearly-linear in the number of nonzeros of the coefficient matrix) solvers for a slight generalization of Laplacian systems (multicommodity Laplacians) would lead to similarly fast linear solvers for all linear equations, so either there are limits on how much you can take advantage of structure or we're going to see some super-fast algorithms for general linear systems.

> Though, the classic dense methods have been implemented and optimised and ported to death with LAPACK/BLAS, haven't they.

Those are generic algorithms that are effective but may be far from being the most efficient, particularly if the problem structure isn't taken in consideration.

> What you're talking about is parallelisation, moving to GPU, and modern (combinatorial) methods for sparse systems, and that's fairly cutting edge, and not trivial to implement/port.

Those are by far not the only issues. For example, dense matrices require n^2 elements, which means that in practice 32GB pack at most a 63k*63k dense matrix. This is rather small for some engineering applications. If anyone wishes to go beyond this, they either have to throw money at the problem to continue running naive code on HPC hardware or they work on algorithms. Today's solvers are actually bundles of custom-made algorithms that apply sequences of algorithms to "massage" the problem to improve their tractability, such as applying partial factorizations or reordering schemes, to improve efficiency.

Moreover, there is yet much research work to be done in the field of matrix factorization. For instance, matrix reduction schemes are starting to make their way into practical engineering applications, and among them we are seeing an increase in demand in specialized techniques such as non-negative matrix factorization. There are no "classical methods" for this sort of factorization scheme, and yet it's a very basic technique.

Yes, I believe that dense methods on single core architectures have been well ported. Of course, even there, it wasn't until I believe Eigen that we were able to moderately easily run automatic differentiation over the algebra and factorizations to get their derivatives. Though, technically, yes, we could have run ADIFOR over LAPACK and hated ourselves in the process.

It's been some time since we've seen significant speed increases for single core computers, so as new chips moved to multicore or GPU processing, the field followed. That said, even moderately basic operations such as matrix multiplication weren't fully implemented until maybe three years. I'm sure there were other groups, but that's when NVIDIA released cuBLAS-XT, which allowed multiplication of matrices that didn't fit on the GPU.

So, yes, I agree that this is nontrivial and difficult. At the academic level, numerical linear algebra tends to be in the domain of applied math departments, so I think that sometimes well suited people in CS don't get introduced to the field. That's partly why I enjoy seeing articles on HN that discuss it.

+1 for Julia is an awesome language, but for most LA problems it kicks it to the BLAS libraries! I think they have now fixed the native julia LA... Next week I'm deploying it to do reed-solomon repair (using a custom numerical type).
I think you are wrong about Julia, at least when it comes to linear algebra. Julia also relies on BLAS/LAPACK, at least those libraries that strive to be fast.
you're absolutely right, I shouldn't have said "most". It basically calls BLAS/LAPACK (as every other language, and as one should), and does almost everything beyond that in Julia.
To be fair, almost all of those linalg kernels have pure-Julia implementations (mainly so they work across any numeric type). I think "most" is a reasonable description, with a few special cases that reach out to BLAS and co.
Good point - Julia does LinAlg with BigFloats and other "exotic" numbers, if you wish. But it definitely also integrates LAPACK/BLAS, and Tim Davis' SuiteSparse.
Can you give an example in popular Julia library(ies) that do that, and, very important, to speed comparisons? Are those operations close to the speed found in MKL for example?
> "...Numerical Linear Algebra from Trefethen..."

wow, what a blast from the past! i took intro CS with trefethen, and he was nice enough and obviously quite smart, but oh boy was his class ever boring. probably one of the deciding factors against my doing a CS major (besides not being terribly good at it).

maybe i should take a gander at his book, since (ironically) i used computational linear algebra for my master's and ended up liking it. i took an old version of numerical recipes (https://en.wikipedia.org/wiki/Numerical_Recipes) and translated what i needed into MATLAB.

> so the impact is wide

Yeah - in Shrek (2001), there's a scene where the eponymous ogre takes a mud bath in the swamp. When I studied this linear algebra stuff, one of my (more advanced) fellow students told me that their research group had "done" the mud... :-)

Penn or Stanford? I think that was mostly Nick Foster’s work.
Stanford. I thought it was Ron Fedkiv’s group, but I might easily misremember, or could’ve been a visitor. I mostly remember because it was such a random thing. The graphics people had a lot of fun, one group was working on hair, fur, and fire or so. All boils down to PDEs, discretised into massive linear systems.
Fedkiw coauthored a paper with Foster in 2001, which is about the time of Shrek, so that must have been it!
For programmers, the really interesting part of dense linear algebra is how to achieve high performance, as blocking techniques have to be used to amortize loads from memory to cache.

Google for Goto's "Anatomy of high-performance matrix multiplication", one of my favorite programming texts.

Also the papers underlying the development of the "Elemental" library for distributed dense linear algebra is worth a look.

BTW, I think one important take-away for programmers is:

Don't invert matrices.

For example, the solution for the least squares (=linear regression) problem is

  beta^hat = (X'X)^-1 X'y
and you might think that the best way to compute it is to compute X'X, invert it, do the multiplication, yada yada yada. But it's really better to just ask your library (Matlab, LAPACK, ...) to solve the problem for you, and it'll probably do a QR decomposition. For small problems, it doesn't really matter, but for big problems you'll be both faster and more accurate.
Excellent John D. Cook piece on the same subject: https://www.johndcook.com/blog/2010/01/19/dont-invert-that-m...
That's a slightly different issue: the issue with solving normal equations directly is that the condition number of X^\top X is the square of the condition number of X, so it's much more significant than lu-vs-inv. It can matter a great deal for ill-conditioned linear regression problems where some features resemble each other.
> how to achieve high performance

You really just want to use LAPACK/BLAS, no? (That's what the Neanderthal library mentioned in the article does, btw, and basically linear algebra libraries for other languages, too. If not, you probably shouldn't use it...)

http://neanderthal.uncomplicate.org

Of course you should but the blocking and amortization techniques have much wider usecases beyond linear algebra. Matrix multiplicaton is a useful "simple" example to study; of course you don't write your own if what you want to do is covered by BLAS/LAPACK.

For instance I have written high performance spherical harmonic transforms, and these techniques apply then.

Golub's "Matrix Computations" remains a must-read reference text here: http://web.mit.edu/ehliu/Public/sclark/Golub%20G.H.,%20Van%2...
Do note that Golub & Van Loan is very much a reference text, however; it is not a great choice if you're just learning the subject (it covers everything, but without much depth and without much exposition).
True. J. W. Demmel, Applied Numerical Linear Algebra, is a gentler introduction.

However, if you're masochist enough to actually implement any of the algorithms, Golub & van Loan is a great reference. (Though, you really shouldn't implement it yourself except for didactic purposes - just use LAPACK/BLAS, which has been debugged for decades, and deals with all the special cases you're ignoring (underflow/overflow/nans/zeros/infs/...))

EDIT: Oh, and there's the excellent Numerical Linear Algebra by Trefethen and Bau, as kxyvr mentions.

EDIT EDIT: Funny, on amazon the top reviews for both books mentioned above are identical. Seems like I'm not the only one having trouble keeping them apart... :-) (Demmel is more of an introduction, FWIW)

I'd also recommend Fundamentals of Matrix Computations by Watkins for a great introduction. It helped immensely when I had to teach myself matrix algorithms as part of my undergraduate research.

https://www.amazon.com/Fundamentals-Matrix-Computations-Davi...

If anyone wants an even more introductory book, this one (not yet published, but PDF drafts available) looks nice: https://web.stanford.edu/~boyd/vmls/
Demmel is also, well, Applied. He goes into (some of the) the nitty-gritty implementation details, where Trefethen and Bau stay at a higher level of algorithmic abstraction.
Thanks. Theres a 4th edition from 2013, which I've just bought.
If you want to learn about this stuff but need a more gentler introduction, check out Robert van de Geijn's MOOC at http://www.ulaff.net (if there is currently no session, just download the lecture notes and use those, they have lecture videos embedded).
For a gentler introduction to LU factorization and how useful the decomposition is for efficiently re-solving the same system multiple times, I'd offer up the "Systems of Equations" section of the "Ultimate Electronics" book I've been working on: https://www.circuitlab.com/textbook/systems-of-equations/

I tried to go back and forth between the 5x+2y=3 style equations that most people are familiar with and the matrix forms.

interactive demo of LU-like and QR-like decomposition of affine matrices:

http://frederic-wang.fr/decomposition-of-2d-transform-matric...

i mostly copy-pasted the js code from this page as a contribution to this lib back in the day: https://github.com/epistemex/transformation-matrix-js