Can somebody please explain how quantization below 8 bit works? Since a byte is the smallest addressable unit I think, is the dimensionality of the weights somehow reduced?
[Author] You approximate the weights using fewer bits. You also switch to ints instead of floats and then do some fancy stuff when multiplying to make it all work together.
Software can address units of any size, by packing and unpacking bits from bytes (or more likely words) in the underlying implementation. I don’t know about any specific NN implementation here, just commenting in general that the size of the addressable unit and the size of your reads can writes can be completely independent. I routinely use bit-packing data compression techniques in CUDA, for example.
Generally, since the memory is byte addressable, you load data which is packed into bytes. It is the compute instructions that use the specified bits needed.
So in this case one would load a byte which would have 2 4b data, and then you would have a 4b ADD or MAC which would operate on them.
If you don't have them then you need to sign/zero extend or convert the smaller bit-widths to 8/16/32b whichever is available.
More detail than you probably wanted: https://huggingface.co/blog/hf-bitsandbytes-integration