| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kherud 1136 days ago
	Can somebody please explain how quantization below 8 bit works? Since a byte is the smallest addressable unit I think, is the dimensionality of the weights somehow reduced?

5 comments

waleedk 1136 days ago

[Author] You approximate the weights using fewer bits. You also switch to ints instead of floats and then do some fancy stuff when multiplying to make it all work together.

More detail than you probably wanted: https://huggingface.co/blog/hf-bitsandbytes-integration

link

MacsHeadroom 1136 days ago

The latest release of bitsandbytes uses a new fp4 format. 4bit floating point scailing results in much lower perplexity than int4.

Also note that for a fixed memory (RAM) size, 4bit (even int4) is always superior, resulting in lower perplexity than 8bit.

E.g. LLaMA-13B int4 is far better/lower perplexity than LLaMA-7B fp8 while using the same amount of RAM.

link

dahart 1136 days ago

Software can address units of any size, by packing and unpacking bits from bytes (or more likely words) in the underlying implementation. I don’t know about any specific NN implementation here, just commenting in general that the size of the addressable unit and the size of your reads can writes can be completely independent. I routinely use bit-packing data compression techniques in CUDA, for example.

link

sifar 1136 days ago

Generally, since the memory is byte addressable, you load data which is packed into bytes. It is the compute instructions that use the specified bits needed.

So in this case one would load a byte which would have 2 4b data, and then you would have a 4b ADD or MAC which would operate on them.

If you don't have them then you need to sign/zero extend or convert the smaller bit-widths to 8/16/32b whichever is available.

link

Ambix 1135 days ago

Go see yourself :)

https://github.com/ggerganov/llama.cpp/blob/master/examples/...

There's too many schemes right now with 4_0 and 5_1 really popular between LLM geeks.

link

f_devd 1136 days ago

I believe it's locally (inner-loop or simd op) up-cast to float8/float16/int8, but I haven't looked at the internals of llama.cpp myself

link