| Really strong binary results. So strong it was fishy. I hope someone can explain my confusion below. > We compared the performance of the Llama2-7B model in three configurations: FP16 (full precision), HQQ (without fine-tuning), and HQQ+ (with adapter layers) using a group-size of 8. Interesting, what is "group-size of 8"? From their HQQ post (https://mobiusml.github.io/hqq_blog/), it's the block size at which they add scales (presumably 16-bit) and shifts (in that post, it's 8-bit). So for every 8 binary weights we have a 16-bit scale and 8-bit shift? > Fine-tuning with Low-Rank Adapters They say they inline the shift into the LoRA but how can you do this, block-wise, without increasing your LoRA rank by num-blocks (they claim to only use 1 additional rank)? Then, the reported 7B sizes, in GB: > 13.5 (fp16) 1.76 (HQQ 1-bit) 1.85 (HQQ+ 1-bit) 2.72 (quip# 2-bit) those numbers would make sense if it was _actually_ 1 bit. But if you include the overhead of 16-bit scales (and why is the shift inlineable into lora? still unexplained) it'd be more like 3-bit. From their HF page: > This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory. Interesting, so we have to go back to CPU to rescale? Is this how they counted GB? This should have been clearly caveated in the table. I also am amazed they got latency lower than quip if they pingpong to CPU. |
All the linear-quantization methods have meta-data, including the 1.58bit paper. You can control the quality vs. memory usage by reducing the group-size. However, the meta-data is the not the same thing as the quantized weights for many reasons:
> The meta-data size doesn't change the fact that you can do binary/ternary matmul, which the most important thing in this story.
> The meta-data size doesn't increase the actual compute: these are point-wise operations and even if you have 1 scalar you still need to multiply the same amount of weights.
> Meta-data is offloaded to the CPU with pinned-memory, which allows non-blocking transfers. Technically, you can trigger the copy in the layer before and synchronize and will make it almost seamless. I did some experiments with cuda streams that worked very well on an older machine, but then I tried a better machine and the transfer was much faster. Obviously if you are trying it on Google colab it's very slow for this reason.
> Smaller models like Llama2-7B are very hard to directly quantize at very low bits, so they need a lower group-size to function well. Larger models (like what we showed for Mixtral), can be quantized to 2-bit on the fly, without any data, and still work very well. So basically larger models are less sensitive to extreme quantization and you could use a much larger group-size. I still think that the meta-data size is really not a big deal for the reasons I have explained above.
> There are many other ways to increase the group-size or even get rid of it all together, many ideas available but needs lots of experimentation.
> Binary/ternary CUDA matmul kernels don't exist yet. The current code is implementing the dequantization step in CUDA but then uses torch.matmul as fp16. I tried doing matmul at low-bits with CUDA but it is very difficult to even beat cuBLAS with fp16, especially for a novice CUDA coder like me :)
Please note: this is early experimental work. Since it showed promising results, we wanted to share it with the community first as we progress. There's still a lot of things to be done and we are actively working on it, despite the very limited resources we have.
Happy to answer any questions here!