|
|
|
|
|
by stephencanon
1577 days ago
|
|
It’s because tridiagonal multiplication doesn’t do enough work to benefit from most of the BLAS techniques discussed here. You can’t do non-trivial vectorization or cache blocking or threading because there simply isn’t very much work to be done unless your matrix is enormous, so a lot of BLAS implementations will basically just use scalar code for this operation. |
|
- the gains are the sort of typical 2-8x speed improvements from vectorization, not the multiple-orders-of-magnitude gains that you can get on dense GEMM.
- the absolute number of flops performed is O(n^2) rather than O(n^3) for GEMM, so even if you could make tridiagonal operations infinitely fast, that optimization effort would be better spent on even small speedups to the O(n^3) work that probably comprises other parts of your algorithm.