|
|
|
|
|
by ColonelPhantom
896 days ago
|
|
Wait, cublas is slower?? Shouldn't it be faster because it's tuned for the specific hardware? I know that llama.cpp has custom kernels for quantized matrices, which are fast because using cublas would require an extra memory roundtrip (read -> dequantize -> write; read -> gemm -> write, vs. read -> dequant -> gemm -> write). But if you're using FP16 the dequantization step shouldn't be necessary. So how is it faster? |
|