Hacker News new | ask | show | jobs
by trombonechamp 1577 days ago
Once I needed a single LAPACK function (tridiagonal matrix multiplication) in a C Python extension. I spent an eternity trying to link to LAPACK reliably across platforms until giving up and deciding to implement it myself.

My implementation ended up being about 10 lines of code and ran slightly faster (!!!) than the LAPACK function. (I think this is because LAPACK provided numerical stability for special cases which didn't apply to my use case.)

1 comments

It’s because tridiagonal multiplication doesn’t do enough work to benefit from most of the BLAS techniques discussed here. You can’t do non-trivial vectorization or cache blocking or threading because there simply isn’t very much work to be done unless your matrix is enormous, so a lot of BLAS implementations will basically just use scalar code for this operation.
That said, it _is_ optimizable, it's just that:

- the gains are the sort of typical 2-8x speed improvements from vectorization, not the multiple-orders-of-magnitude gains that you can get on dense GEMM.

- the absolute number of flops performed is O(n^2) rather than O(n^3) for GEMM, so even if you could make tridiagonal operations infinitely fast, that optimization effort would be better spent on even small speedups to the O(n^3) work that probably comprises other parts of your algorithm.