Hacker News new | ask | show | jobs
by mrWiz 1196 days ago
The 4 bit quantization performs well, though. Does your 1 bit version?
2 comments

1 bit will mathematically be guaranteed to be more efficient for performance-per-parameter so to me it is a pretty clear eventuality one day, but I think also the relative performance % will likely tank still. Impressed honestly that it held so well at 4 bit tbh, I thought personally that 8 bit was the ceiling.

However I can see fractional bits (via binary representations) and larger models happening first before that compression step.

And then we have the sub-bit range..... ;DDDD

Do you have the numbers? I suspect is is way worse. Original llama.cpp authors never measure any numbers as well.
The python implementation[1] ran some tests using the same quantization algorithm as llama.cpp (4 bit RTN).

1: https://github.com/qwopqwop200/GPTQ-for-LLaMa

Great thanks a lot.

So we have numbers on PTB original perplexity 8.79 quantized 9.68, already 10% worse. And PPL reported per token I suppose? Because word PPL for PTB must be around 20, not less than 10.

Any numbers on more complex tasks then? like QA?

They're using GTPQ -- here you go: https://arxiv.org/abs/2210.17323 . The authors benchmarked two families of models over a wide range of numbers of params.
llama.cpp is using RTN at the moment.