| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stephencanon 1577 days ago
	It’s because tridiagonal multiplication doesn’t do enough work to benefit from most of the BLAS techniques discussed here. You can’t do non-trivial vectorization or cache blocking or threading because there simply isn’t very much work to be done unless your matrix is enormous, so a lot of BLAS implementations will basically just use scalar code for this operation.

1 comments

stephencanon 1577 days ago

That said, it _is_ optimizable, it's just that:

- the gains are the sort of typical 2-8x speed improvements from vectorization, not the multiple-orders-of-magnitude gains that you can get on dense GEMM.

- the absolute number of flops performed is O(n^2) rather than O(n^3) for GEMM, so even if you could make tridiagonal operations infinitely fast, that optimization effort would be better spent on even small speedups to the O(n^3) work that probably comprises other parts of your algorithm.

link