Hacker News new | ask | show | jobs
by costrouc 3067 days ago
For those wondering the lapack guys are working on a new version of lapack to run on heterogeneous architecture including gpus. See their work at http://www.icl.utk.edu/research

I have worked with these guys and all I can say is good luck outperforming the highly optimized routines they have written. My bet is that the the guy writing this blog used a non optimized version of lapack.

1 comments

There's no mystery here, author is only solving a 5x5 system. LAPACK will have barely finished checking function call parameters by the time a fully specialized and inlined solver is finished. Also, from a cursory inspection, author does no pivoting [edit: worse, it just uses Cramer's rule, which is insane], so their solver will be much less numerically robust (but this is admittedly less of a problem for the 5x5 case).

The performance is completely expected--in fact, I would expect an optimized 5x5 solver to be faster than this.

A better comparison might be a C++ library like Eigen that is intended for use on these small systems, that should optimize out function calls in a similar way, and that uses intrinsics.
If your matrices are of arbitrary size, Blaze [0] would be my pick. It has the best performance on their benchmarks, Baptiste Wicht's benchmarks, and my own.

Though if the matrices are guaranteed to be small (like in this example), libxsmm [1] is specialized and highly optimized for this use case and beats its competitors above.

And yes, it's absolutely essential to make sure you didn't just move computations to compile-time.

[0]: https://bitbucket.org/blaze-lib/blaze

[1]: https://github.com/hfp/libxsmm

Cramer's rule is unstable even for 2x2 problems
I think gp is missing a “less” before “numerically robust”
Yeah, thanks. Fixed.