Hacker News new | ask | show | jobs
by hansvm 881 days ago
Using just basic large dense square matrix multiplication on a modern CPU as an example, not caring about numerical stability, embarrassingly parallel implementations have high bandwidth to compute ratios, bottlenecking your results 10x - 1000x. Embarrassingly parallelizing cache friendlier solutions is prone to stragglers and other scheduling issues. If you're operating on asymptotically faster solutions like Strassen's it's at least a little nontrivial to keep the few largest recursive layers from dominating the computation and not really being much faster than other parallel algorithms.

As soon as you start mixing numerical stability (data-dependant operation reordering) concerns into linear algebra routines -- arguably one of the main points of LAPACK -- parallelization necessarily gets trickier because you either need a good enough heuristic to ignore those reorderings or you also need data-dependant scheduling without too many pathologies.