Hacker News new | ask | show | jobs
by amanrs 1059 days ago
The key misconception about many quantization methods is that lower precision = better speed.

I believe GPT-Q is not much faster than bf16 from skimming the AWQ paper - https://arxiv.org/pdf/2306.00978.pdf

It's 3x faster for a batch size of 1, but that's still over 10x more expensive than gpt-3.5

For larger batch sizes, bf16 costs dip below 3-bit quantized.

1 comments

exLlama supports batching, and I believe it claws back much the throughput loss from quantization (depending on the exact settings you use to quantize).

And as said below, whatever throughput you lose is going to be massively offset by the ability to use smaller single GPUs.