|
|
|
|
|
by KVFinn
1200 days ago
|
|
Very cool. I've seen some people running 4-bit 65B on dual 3090s, but didn't notice a benchmark yet to compare. It looks like this is regular 4-bit and not GPTQ 4-bit? It's possible there's quality loss but we'll have to test. >4-bit quantization tends to come at a cost of substantial output quality losses. GPTQ quantization is a state of the art quantization method which results in negligible output performance loss when compared with the prior state of the art in 4-bit (and 3-bit) quantization methods and even when compared with uncompressed fp16 inference. https://github.com/ggerganov/llama.cpp/issues/9 |
|