Hacker News new | ask | show | jobs
by liuliu 897 days ago
One of these days I will find time to write more about model inference optimizations without going through distillation / quantization. Case in point: switching llama.cpp from custom kernel to cublas's GEMM implementation reduces speed from 70tok/s to 49tok/s (RTX 6000 Ada, Mistral-7B, FP16).
4 comments

Wait, cublas is slower?? Shouldn't it be faster because it's tuned for the specific hardware?

I know that llama.cpp has custom kernels for quantized matrices, which are fast because using cublas would require an extra memory roundtrip (read -> dequantize -> write; read -> gemm -> write, vs. read -> dequant -> gemm -> write). But if you're using FP16 the dequantization step shouldn't be necessary. So how is it faster?

I'd be interested in this, so you've got at least one reader.
So isn’t that why they have a custom kernel?
I'm interested as well.