Hacker News new | ask | show | jobs
by Palmik 957 days ago
This is amazing, and will unlock many possibilities. I just recently read the S-LoRA paper, which is related, but it's even better to have a working (and extremely efficient!) implementation.

How hard would it be to adapt your kernels to work with the new-gen quants like AWQ or EXL2?

1 comments

Thanks for your encouragement! We are working on quantization as well. We recently submitted a paper, Atom [1], that uses 4-bit quantization, delivering 7.73x throughput compared to FP16 and 2.53x compared to INT8. Atom is able to maintain a perplexity (i.e., model accuracy) close to FP16, outperforming existing quantization approaches.

We are polishing the 4-bit code. It will be added to Punica code base soon. Please stay tuned :)

[1] https://arxiv.org/abs/2310.19102

Added to my reading list! The world of quantizations is moving so fast even TheBloke might not be able to keep up!

So Atom base models would be compatible with Punica?

I also wonder, many people already train LoRAs in 8 or even 4 bit (for the base model), would it make sense to match the quantization algo used during training and inference?