|
|
|
|
|
by radarsat1
3187 days ago
|
|
> What you're talking about is parallelisation, moving to GPU, and modern (combinatorial) methods for sparse systems, and that's fairly cutting edge, and not trivial to implement/port. Honestly, it might be tricky, but implementing matrix operations is not rocket science either. I find it incredible that so many projects rely on NVidia's proprietary libraries for doing this on GPU. Maybe there is some secret juice that I just don't know about, but it seems to me there can't be that many optimisation shortcuts for matrix multiplications and the like that require intimate, secret knowledge of the hardware. |
|
Actually, it's very hard to implement efficient algebraic matrix-matrix and matrix-vector operations, although naive implementations are very easy to pull off. You're fooling yourself if you believe you can whip out an implementation for basic BLAS-(1|2|3) kernels that matches the performance of properly tuned implementations. Implementing a kernel whose performance doesn't stray too far from the hardware's capacity takes a lot of knowledge and work on low-level details such as cache hierarchies and its impact on the memory access performance. Floating point operations actually take a back-seat to memory access, as they represent a small fraction of the operations being performed by the kernel (IIRC, in sparse matrix operations the proportion of fp operations is only about 1-in-7) and the bulk of the implementation is focused on memory access that minimizes cache misses. Therefore, to even be in a position to implement an acceptable matrix-vector or matrix-matrix kernel you need to have a solid understanding on how particular performances handle memory. This isn't trivial, and it's one of the reasons why articles about implementing X on a GPU, even if X is a classic algorithm, are accepted and published by specialized publications.