Hacker News new | ask | show | jobs
by brucethemoose2 1058 days ago
exLlama supports batching, and I believe it claws back much the throughput loss from quantization (depending on the exact settings you use to quantize).

And as said below, whatever throughput you lose is going to be massively offset by the ability to use smaller single GPUs.