| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by giaf 1679 days ago

Thanks for mentioning Octavian, I didn't know about this interesting project. Are you referring to single- or multi-threaded applications?

In the context of embedded optimal control applications (i.e. the original framework motivating the Prometeo development), applications are typically single-threaded, and in this case for matrices of size 100x100 MKL is _very_ close to peak performance already, there is no way something can be 2x faster without breaking the laws of physics. [Trust that I know what I'm saying here, as the main BLASFEO developer, I check MKL performance often enough ;) ] Just for reference, MKL has special flags MKL_DIRECT_CALL and MKL_DIRECT_CALL_SEQ which enable extra optimizations improving performance for small matrices (e.g. turn off most input arguments checks), these should definitely be used in a fair comparison.

On top of that, linear algebra is much more than matrix-matrix multiplication, and e.g. in embedded optimal control the performance of factorization routines plays a key role.

1 comments

adgjlsfhk1 1678 days ago

Octavian is absolutely early in it's development (currently I think it only supports matmul including all the transposed versions). https://raw.githubusercontent.com/JuliaLinearAlgebra/Octavia... is the benchmark. It uses automatic threading from both MKL and Octavian (although for these sizes, it will only use a few threads). With only one thread, MKL is much closer and is only behind by about 20% at n=25 and roughly equal by n=60. I haven't done timings with MKL_DIRECT_CALL or MKL_DIRECT_CALL_SEQ, but I think that's unfair since Octavian has the same overhead of figuring out how many threads to use.

link

giaf 1678 days ago

Looking forward to see Octavian development then, it looks exciting! Dealing with triangular matrices and data dependencies in other linear algebra routines such as triangular solves and factorization will surely be an interesting benchmark for the approach, since such difficulties do not arise in matrix-matrix multiplication. Anyway, that's surely a good starting point for Octavian.

Just one clarification: MKL_DIRECT_CALL or MKL_DIRECT_CALL_SEQ is not about figuring out how many threads to use, it's about turning off checks on input arguments sizes, e.g. if m>lda, or negative lda or m or stuff like that. All these pedantic checks (which comply with the reference BLAS implementation in Netlib) are often times not done anyway in experimental linear algebra packages that do not aim at providing a compliant implementation of the standard Fortran BLAS.

link