|
|
|
|
|
by amanrs
1059 days ago
|
|
The key misconception about many quantization methods is that lower precision = better speed. I believe GPT-Q is not much faster than bf16 from skimming the AWQ paper - https://arxiv.org/pdf/2306.00978.pdf It's 3x faster for a batch size of 1, but that's still over 10x more expensive than gpt-3.5 For larger batch sizes, bf16 costs dip below 3-bit quantized. |
|
And as said below, whatever throughput you lose is going to be massively offset by the ability to use smaller single GPUs.