Hacker News new | ask | show | jobs
by moffkalast 806 days ago
Correct me if I'm wrong, but in the tests I've run, the matmul optimizations only have an effect if there's no other blas acceleration. If one can at least offload the KV cache to cublas or run with openblas it's not really used, right? At least I didn't see any speedup in with that config when comparing that PR to the main llama.cpp branch.
1 comments

The code that launches my code (see ggml_compute_forward_mul_mat) comes after CLBLAST, Accelerate, and OpenBLAS. The latter take precedence. So if you're not seeing any speedup in enabling them, it's probably because tinyBLAS has reached terms of equality with the BLAS. It's obviously nowhere near as fast as cuBLAS, but maybe PCIE memory transfer overhead explains it. It also really depends on various other factors, like quantization type. For example, the BLAS doesn't support formats like Q4_0 and tinyBLAS does.