| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ColonelPhantom 896 days ago
	Wait, cublas is slower?? Shouldn't it be faster because it's tuned for the specific hardware? I know that llama.cpp has custom kernels for quantized matrices, which are fast because using cublas would require an extra memory roundtrip (read -> dequantize -> write; read -> gemm -> write, vs. read -> dequant -> gemm -> write). But if you're using FP16 the dequantization step shouldn't be necessary. So how is it faster?