Hacker News new | ask | show | jobs
by FabHK 3187 days ago
Though, the classic dense methods have been implemented and optimised and ported to death with LAPACK/BLAS, haven't they.

What you're talking about is parallelisation, moving to GPU, and modern (combinatorial) methods for sparse systems, and that's fairly cutting edge, and not trivial to implement/port.

You'll need to have a pretty good understanding of the language and its paradigmatic use, and of linear algebra, and of modern computer architecture.

However, I assume that you're right in that there might still be some low-hanging fruit.

BTW, Julia is an awesome language with excellent LA support, and it's nice in that most algorithms are coded in Julia itself (unlike the two-language situation in Python, Clojure (AFAIK), etc.)

5 comments

> What you're talking about is parallelisation, moving to GPU, and modern (combinatorial) methods for sparse systems, and that's fairly cutting edge, and not trivial to implement/port.

Honestly, it might be tricky, but implementing matrix operations is not rocket science either. I find it incredible that so many projects rely on NVidia's proprietary libraries for doing this on GPU. Maybe there is some secret juice that I just don't know about, but it seems to me there can't be that many optimisation shortcuts for matrix multiplications and the like that require intimate, secret knowledge of the hardware.

> Honestly, it might be tricky, but implementing matrix operations is not rocket science either.

Actually, it's very hard to implement efficient algebraic matrix-matrix and matrix-vector operations, although naive implementations are very easy to pull off. You're fooling yourself if you believe you can whip out an implementation for basic BLAS-(1|2|3) kernels that matches the performance of properly tuned implementations. Implementing a kernel whose performance doesn't stray too far from the hardware's capacity takes a lot of knowledge and work on low-level details such as cache hierarchies and its impact on the memory access performance. Floating point operations actually take a back-seat to memory access, as they represent a small fraction of the operations being performed by the kernel (IIRC, in sparse matrix operations the proportion of fp operations is only about 1-in-7) and the bulk of the implementation is focused on memory access that minimizes cache misses. Therefore, to even be in a position to implement an acceptable matrix-vector or matrix-matrix kernel you need to have a solid understanding on how particular performances handle memory. This isn't trivial, and it's one of the reasons why articles about implementing X on a GPU, even if X is a classic algorithm, are accepted and published by specialized publications.

The hard part isn’t getting it basically right. The hard part is knowing where subtle bugs are.

I remember my numerical computing textbook had one major takeaway: almost always write your own matrix algorithms; almost never use them.

An extraordinary amount of work has gone into linear algebra computation and much of it for mission critical stuff. It might not be “rocket science” but it’s close.

This is an important point, but I don't think it's insurmountable for a broader group of people.

Though technically not required, I think it helps to know a good deal of mathematical theory as well in order to check properties that should hold, which helps in tracking down subtle bugs. In my experience, tracking these bugs is a royal pain in the ass and writing good tests for these properties is tedious. Even then, there's a lot of funny stuff that goes on. One of my favorites is that it's possible to take the Choleski factorization of an indefinite matrix in some situations. That means it's not safe to use a Choleski to check for whether or a not a matrix is positive definite for mission critical functions:

https://scicomp.stackexchange.com/a/26223

But, anyway, mostly I contend that writing these algorithms is extremely tedious, so there's not that many people with the patience. I also still maintain that there's room for contribution here. Although most often these routines are written in C and Fortran, the parallel tooling for these languages is less sophisticated than that of modern languages. I've been wondering whether a language like Rust would make it easier to code and parallelize some of these algorithms.

> implementing matrix operations is not rocket science either.

Naively implementing them sure, but looking at OpenBLAS/LAPACK/etc code it seems pretty close to rocket science

It's not rocket science by any means, and the 'well behaved' case for matrix multiplication (sufficiently 'squarish' dimensions, i.e. not too similar to a dot product, running on a known system configuration) has been beaten to death, revived, and beaten to death again many, many times and current libraries can get ~90% of theoretical peak.

That being said, there are a lot of nontrivial cases and pitfalls lying around that can cause some grief to someone trying to reinvent the wheel, as well as open theoretical problems that could lead to significant speedups if they're resolved. Here's a short and definitely not complete list of these:

- Communication costs (transferring data between fast memory and slow memory, e.g. cache <-> memory or GPU memory <-> main memory) is the bottleneck in most cases. The asymptotic communication costs (both in terms of lower bounds, and in terms of attaining said lower bounds through a judicious choice of tiling sizes) of many of these problems are well-understood (see [Irony-Toledo '04] for the classic work on this, extended to most of linear algebra by [Ballard et al. '14]), but these asymptotic analyses have lots of hidden constants and parameters that need to be optimized for not just a given architecture but also a given problem size. For instance, suppose you have an inner-product or an input of sufficiently similar dimension (i.e. multiplying matrix of dimension mn by matrix of dimension nk, n >> m, k). If you're going to split this thing up into parallel blocks (e.g. through divide-and-conquer along the 'middle dimension' n, since we assume the other dimensions are small enough that you can't split them up too much) you'll have to allocate nksizeof(double, or whatever your matrix elements are) additional memory for each thread, since each thread needs a place to store its sub-result before they're all summed together. The theory will tell you that replicating until you fill up your memory is "optimal" (i.e. within a constant factor of the theoretical communication lower bound), but that constant can be fairly large (integer-factor in many cases), and optimizing that mostly is a matter of experimentation and intuition. In fact, back in 2013 the latest release of both Intel MKL and PLASMA did not deal with this case particularly well, although I haven't tried more recent versions.

- Floating point addition is not associative and if you're using parallelism, there's no guarantee that two subsequent runs of the same program will give you the same result since the additions can happen in any order. For some cases, most notably debugging (but also times when you need code that meets certain verification or contractual requirements - you can't have code giving different results at different iterations) you want 'reproducibility', which is very nontrivial to implement (see: ReproBLAS project).

- Precision issues: Do you need double-precision numbers? or will a single-precision or something smaller work? Often problems are more sensitive to certain results than others, so you'd want to switch between different precisions at different stages in your problem for additional optimization (see: Rubio Gonzalez et al.'s paper on Precimonious).

- Matrix multiplication (and therefore all the other algorithms that depend on it, e.g. GEPP, TRSM, Cholesky...) has significantly faster algorithms than the classical O(n^3) naive one. Strassen's method (runtime ~O(n^2.8)) is sometimes used in practice, but there are algorithms with significantly better exponents (I believe the world record is currently held by Le Gall at somewhere around 2.37). Unfortunately, these algorithms tend to have massive constant factors render them impractical for real-world use, and as far as I know figuring out if these can actually be applied these in practice is an open problem. Reducing the exponent (or finding lower bounds for it) is also an area of active research of importance to not just the numerical linear algebra community but also to the complexity theory crowd (since a lot of problems can be reduced to matrix multiplication).

- Advances in taking advantage of matrix structure to speed up linear solvers, most notably with Laplacian/strictly diagonally dominant solvers started by Spielman and Teng in '04 and built upon in recent years. As with the faster matrix multiplication algorithms many of these solvers are not yet competitive compared to stuff used in practice (multigrid methods and the like), which don't have the same kinds of guarantees but run quickly in practice. (I don't believe that there's been a whole lot of work in terms of actually implementing and optimizing of Laplacian solvers either). In the theory world it's also interesting to ask how far these efficient methods will go (how large is the class of systems that you can solve quickly?) - Kyng and Zhang have a paper this year, for instance, stating that fast (nearly-linear in the number of nonzeros of the coefficient matrix) solvers for a slight generalization of Laplacian systems (multicommodity Laplacians) would lead to similarly fast linear solvers for all linear equations, so either there are limits on how much you can take advantage of structure or we're going to see some super-fast algorithms for general linear systems.

> Though, the classic dense methods have been implemented and optimised and ported to death with LAPACK/BLAS, haven't they.

Those are generic algorithms that are effective but may be far from being the most efficient, particularly if the problem structure isn't taken in consideration.

> What you're talking about is parallelisation, moving to GPU, and modern (combinatorial) methods for sparse systems, and that's fairly cutting edge, and not trivial to implement/port.

Those are by far not the only issues. For example, dense matrices require n^2 elements, which means that in practice 32GB pack at most a 63k*63k dense matrix. This is rather small for some engineering applications. If anyone wishes to go beyond this, they either have to throw money at the problem to continue running naive code on HPC hardware or they work on algorithms. Today's solvers are actually bundles of custom-made algorithms that apply sequences of algorithms to "massage" the problem to improve their tractability, such as applying partial factorizations or reordering schemes, to improve efficiency.

Moreover, there is yet much research work to be done in the field of matrix factorization. For instance, matrix reduction schemes are starting to make their way into practical engineering applications, and among them we are seeing an increase in demand in specialized techniques such as non-negative matrix factorization. There are no "classical methods" for this sort of factorization scheme, and yet it's a very basic technique.

Yes, I believe that dense methods on single core architectures have been well ported. Of course, even there, it wasn't until I believe Eigen that we were able to moderately easily run automatic differentiation over the algebra and factorizations to get their derivatives. Though, technically, yes, we could have run ADIFOR over LAPACK and hated ourselves in the process.

It's been some time since we've seen significant speed increases for single core computers, so as new chips moved to multicore or GPU processing, the field followed. That said, even moderately basic operations such as matrix multiplication weren't fully implemented until maybe three years. I'm sure there were other groups, but that's when NVIDIA released cuBLAS-XT, which allowed multiplication of matrices that didn't fit on the GPU.

So, yes, I agree that this is nontrivial and difficult. At the academic level, numerical linear algebra tends to be in the domain of applied math departments, so I think that sometimes well suited people in CS don't get introduced to the field. That's partly why I enjoy seeing articles on HN that discuss it.

+1 for Julia is an awesome language, but for most LA problems it kicks it to the BLAS libraries! I think they have now fixed the native julia LA... Next week I'm deploying it to do reed-solomon repair (using a custom numerical type).
I think you are wrong about Julia, at least when it comes to linear algebra. Julia also relies on BLAS/LAPACK, at least those libraries that strive to be fast.
you're absolutely right, I shouldn't have said "most". It basically calls BLAS/LAPACK (as every other language, and as one should), and does almost everything beyond that in Julia.
To be fair, almost all of those linalg kernels have pure-Julia implementations (mainly so they work across any numeric type). I think "most" is a reasonable description, with a few special cases that reach out to BLAS and co.
Good point - Julia does LinAlg with BigFloats and other "exotic" numbers, if you wish. But it definitely also integrates LAPACK/BLAS, and Tim Davis' SuiteSparse.
Can you give an example in popular Julia library(ies) that do that, and, very important, to speed comparisons? Are those operations close to the speed found in MKL for example?
Well if you define a new type, and * and + operations for it, you will get matrix multiplication for free in native julia. I think you can do similar things for a few other algorithms in native julia. I don't think it will be MKL speed though because MKL uses cache size information etc. and is very optimized. Of course julia uses BLAS/Lapack etc for the types it can.

GenericSVD.jl is an interesting example of how you can factorize a matrix of quaternions by just defining appropriate functions for quarternions and running using the generic SVD algorithm (it's just an example of what I talked about above, the library is not really used by anyone I think).

Knet.jl is a fairly mature deep learning library that really demonstrates the power of the language. You simply define your forward and loss functions, and then you are set. The function gets auto diferentiated and you go ahead and do SGD. There really isn't anything else you need to mess with. It also works seamlessly with GPU arrays without you having to touch the forward and loss functions.

When I hear "you just do X", I become sceptical of that instantly :)