| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by liuliu 897 days ago
	One of these days I will find time to write more about model inference optimizations without going through distillation / quantization. Case in point: switching llama.cpp from custom kernel to cublas's GEMM implementation reduces speed from 70tok/s to 49tok/s (RTX 6000 Ada, Mistral-7B, FP16).

4 comments

ColonelPhantom 896 days ago

Wait, cublas is slower?? Shouldn't it be faster because it's tuned for the specific hardware?

I know that llama.cpp has custom kernels for quantized matrices, which are fast because using cublas would require an extra memory roundtrip (read -> dequantize -> write; read -> gemm -> write, vs. read -> dequant -> gemm -> write). But if you're using FP16 the dequantization step shouldn't be necessary. So how is it faster?

link

golly_ned 896 days ago

I'd be interested in this, so you've got at least one reader.

link

sroussey 896 days ago

So isn’t that why they have a custom kernel?

link

thelastparadise 896 days ago

I'm interested as well.

link