| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by silentvoice 35 days ago

Almost no part of the algorithm as specified in the blog post is GPU friendly, but it can be improved.

The "symbolic" phase is very hard to port to the GPU in the first place and almost always happens on the CPU side. But even then it's hard (but not impossible) even to parallelize it. This is generally OK because it is usually cheaper than the numeric phase anyways. But it can be trouble on workstations where the CPU exists primarily as a device for shoveling data into many GPUs, you don't want to stall their kernel pipelines with your symbolic analysis.

For the numeric phase you need to introduce the concept of either a "front" or a "supernode." This is a technique where multiple columns get batched together and you get a new elimination tree in terms of those batches of columns rather than individual columns. This turns a lot of irregular memory access (gather/scatter style updates) into densely addressed memory patterns, often just calls into level 3 BLAS or LAPACK.

There are some sparsity patterns for which the above supernode/multifrontal approach does not work very well, but for many practical cases like PDE simulations it does. The sparsity patterns which defeat it usually spread nonzeros out all over the matrix rather than containing them within a specific band. What this causes in practice is when you materialize a supernode's dense matrix, most of its numeric entries will just be 0s, so you create a lot of extra work that serves no productive purpose.