| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by brucethemoose2 1062 days ago

> We serve Llama on 2 80-GB A100 GPUs, as that is the minumum required to fit Llama in memory (with 16-bit precision)

Well there is your problem.

LLaMA quantized to 4 bits fits in 40GB. And it gets similar throughput split between dual consumer GPUs, which likely means much better throughput on a single 40GB A100 (or a cheaper 48GB Pro GPU)

https://github.com/turboderp/exllama#dual-gpu-results

And this is without any consideration of batching (which I am not familiar with TBH).

Also, I'm not sure which model was tested, but Llama 70B chat should have better performance than the base model if the prompting syntax is right. That was only reverse engineered from the Meta demo implementation recently.

There are other "perks" from llama too, like manually adjusting various generation parameters, custom grammar during generation and extended context.

1 comments

amanrs 1062 days ago

The key misconception about many quantization methods is that lower precision = better speed.

I believe GPT-Q is not much faster than bf16 from skimming the AWQ paper - https://arxiv.org/pdf/2306.00978.pdf

It's 3x faster for a batch size of 1, but that's still over 10x more expensive than gpt-3.5

For larger batch sizes, bf16 costs dip below 3-bit quantized.

link

brucethemoose2 1062 days ago

exLlama supports batching, and I believe it claws back much the throughput loss from quantization (depending on the exact settings you use to quantize).

And as said below, whatever throughput you lose is going to be massively offset by the ability to use smaller single GPUs.

link