| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nshm 1196 days ago
	It is not really llama, it is llama quantized to 4bit. Not even the quality of original 7B. I could also quantize it to 1 bit and claim it runs on my RPI3.

4 comments

elbigbad 1196 days ago

The quantization to four hits doesn’t have that much effect on the output. 1 bit might not either, but someone would need to do some testing before making the claim that “1 bit … runs on my RPI3” because “runs” is a bit overloaded to mean “runs and produces sensible output.” I think you’re missing that runs here has that overloading.

link

nwoli 1196 days ago

It should also be mentioned that it isn’t really that each weight is a 4 bit float, but rather that they’re basically clustering floats into 2^4 clusters and then grabbing from a lookup table the float associated to a 4 bit value as needed. So as long as the weights roughly fall into 16 clusters you’ll get identical results

link

alden5 1196 days ago

i haven't noticed 4bit quantization affecting the quality of LLaMA-7B, it produces very coherent outputs, the trick is having a good example in your prompt so it has a good idea of what's expected of it.

link

muttled 1195 days ago

Quality and quantity: I've had the best luck cramming a bunch of examples into the input, just like with GPT-J where you're only working with 6B parameters. Making sure the format stays consistent and ideally presented in the shape you'd encounter that same text if you found it on a webpage somewhere.

link

mrWiz 1196 days ago

The 4 bit quantization performs well, though. Does your 1 bit version?

link

tbalsam 1196 days ago

1 bit will mathematically be guaranteed to be more efficient for performance-per-parameter so to me it is a pretty clear eventuality one day, but I think also the relative performance % will likely tank still. Impressed honestly that it held so well at 4 bit tbh, I thought personally that 8 bit was the ceiling.

However I can see fractional bits (via binary representations) and larger models happening first before that compression step.

And then we have the sub-bit range..... ;DDDD

link

nshm 1196 days ago

Do you have the numbers? I suspect is is way worse. Original llama.cpp authors never measure any numbers as well.

link

ddren 1196 days ago

The python implementation[1] ran some tests using the same quantization algorithm as llama.cpp (4 bit RTN).

1: https://github.com/qwopqwop200/GPTQ-for-LLaMa

link

nshm 1196 days ago

Great thanks a lot.

So we have numbers on PTB original perplexity 8.79 quantized 9.68, already 10% worse. And PPL reported per token I suppose? Because word PPL for PTB must be around 20, not less than 10.

Any numbers on more complex tasks then? like QA?

link

summarity 1196 days ago

Some numbers here: https://github.com/qwopqwop200/GPTQ-for-LLaMa#result

link

sottol 1196 days ago

They're using GTPQ -- here you go: https://arxiv.org/abs/2210.17323 . The authors benchmarked two families of models over a wide range of numbers of params.

link

ddren 1196 days ago

llama.cpp is using RTN at the moment.

link

renewiltord 1196 days ago

I used the 7B quantized to 4 bit and it needs a few tries for most things, but it's not useless.

link