Hacker News new | ask | show | jobs
by Schneekatze 4795 days ago
Hi, shark developer here.

First of all, we are glad that our library is discussed on this board! We are happy for every feedback we can get!

Regarding Performance: we try to get the key algorithms as fast as possible. And for the hardest parts we rely not on ublas, but use optional bindings to ATLAS. Speed was one of the key design criteria. We hope that we achieved that. Clearly this is no guarantee that every algorithm is fast, but in this case: just add a ticket!

Please bear in mind, that Shark is still in beta stage, and we are heavily developing it (I am right now working on the family of multi class SVMs). So for example parallelism using OpenMP is not fully integrated.

3 comments

You should consider using Eigen for linear algebra, I have personally found it much better performance wise than using bindings to ATLAS or other more standard linear algebra solutions. ML algorithms tend to be multistage (think about the update of weights with momentum in a Neural network for example), and the primitives available in ATLAS or a blas library are really too low level. Eigen since it generates code for a whole complicated expression can blow a standard linear algebra library out of the water for a certain class of problems. For others obviously the highly tuned vendor BLAS code would win, but I've seen huge speedups by using Eigen it fits well for complex ML operations.
This is true, even though we would rather switch to Armadillo due to it's easier handling and better high-level behaviour.

Right now the linear algebra library we use -ublas- has the same behaviour as Eigen for BLAS1 type expressions. So it tries to generate optimal (non-SSE) code. Only for BLAS2 and 3 we fall back to the ATLAS-routines which has the same performance as Eigen on the interesting problem sizes.

//small edit In the end it is not so interesting whether the BLAS1-type expressions are fast as they make up < 1% of run time performance. The big chunks are the data processing inside the matrix-matrix multiplications of the Neural Networks and similar entities.

You forget that if you can do the whole weight update in a single shot operation that the data doesn't have to go through the cache multiple times, and at least on the problem sizes I am working on FP bandwidth isn't what kills it, but the memory bandwidth. Back to the NN example: If you can do the matrix multiply and the application of the delta weights in a single loop iteration you get much better cache behavior.

Another thing about code generation, I am also using a hacked version of Eigen as well in a project I'm working on that can do the tanh and derivative of the tanh so the NN activations go quite abit faster since you can generate vectorized code for the whole calculation that will visit the memory location exactly once. While true the calculation of the weight updates is the most time spent, I saw 3-4x speedup in the activation code doing it in a single operation due to better memory access patterns and less loop iterations. Better memory access patterns can also have synergistic effects on other code because there is less cache pollution happening. By being fast and loose and introducing a few other copies of the matrix data in my case, my performance falls off a cliff when it no longer fits in the cpu cache nicely. 10x difference in the particular case I am remembering.

As always performance is part art, part science and perhaps it won't matter as much for the general case, but for my specific implementation and my matrix sizes Eigen has made a measurable difference for me compared to other solutions.

FYI, another bonus on Eigen is if you are using a Matrix-Matrix operation you will get multithreading with OpenMP for free. It isn't the most efficient way (you can do better threading yourself probably), but it gets most of the way there for stuff that will use GEMM operations.
Amazing work, thanks for sharing. Any hint on how it behaves on big datasets ? Over time I've found that scaling up to real world (and industry) datasets requires going custom (or distributed, a-la-mr / hadoop).
Unfortunately, we don't have much experience with industry sized datasets - simply because we don't have them. I know that our SVMs are among the fastest of the world. At least we beat Libsvm and Liblinear(and there certainly on very big datasets!). But we lack support for hadoop/mpi, even though we would like to change that in the future.

Right now I would say that the main focus of shark is research oriented. That is we want to be fast but also modular so that we can still easily exchange different aspects of the algorithms with our own work. As these goals sometime clash, it is hard to claim that we are the fastest, simply because there is for nearly every algorithm some way to improve when you know exactly which combination of model, loss function and training algorithm you use. But we are (hopefully) reasonably fast and certainly want to improve.