|
Really great write-up, thank you John. Two naive questions. First, with the 4060 Ti, are those the 16gb models? (I'm idly comparing pricing in Australia, as I've started toying with LM-Studio and lack of VRAM is, as you say, awful.) Semi-related, the actual quantisation choice you made wasn't specified. I'm guessing 4 or 5 bit? - at which point my question is around what ones you experimented with, after setting up your prompts / json handling, and whether you found much difference in accuracy between them? (I've been using mistral7b at q5, but running from RAM requires some patience.) I'd expect a lower quantisation to still be pretty accurate for this use case, with a promise of much faster response times, given you are VRAM-constrained, yeah? |
I use 4-bit GPTQ quants. I use tensor parallelism (vLLM supports it natively) to split the model across two GPUs, leaving me with exactly zero free VRAM. there are many reasons behind this decision (some of which are explained in the blog):
- TheBloke's GPTQ quants only support 4-bit and 3-bit. since the quality difference between 3-bit and 4-bit tends to be large, I went with 4-bit. I did not test, but I wanted high accuracy for non-assistant tasks too, so I simply went with 4-bit.
- vLLM only supports GPTQ, AWQ, and SqueezeLM for quantization. vLLM was needed to serve multiple clients at a time and it's very fast (I want to use the same engine for multiple tasks, this smart assistant is only one use case). I get about 17 tokens/second, which isn't great, but very functional for my needs.
- I chose GPTQ over AWQ for reasons I discussed in the post, and don't know anything about SqueezeLM.