|
|
|
|
|
by liuliu
897 days ago
|
|
One of these days I will find time to write more about model inference optimizations without going through distillation / quantization. Case in point: switching llama.cpp from custom kernel to cublas's GEMM implementation reduces speed from 70tok/s to 49tok/s (RTX 6000 Ada, Mistral-7B, FP16). |
|
I know that llama.cpp has custom kernels for quantized matrices, which are fast because using cublas would require an extra memory roundtrip (read -> dequantize -> write; read -> gemm -> write, vs. read -> dequant -> gemm -> write). But if you're using FP16 the dequantization step shouldn't be necessary. So how is it faster?