|
|
|
|
|
by abcdabcd987
947 days ago
|
|
Thanks for your encouragement! We are working on quantization as well. We recently submitted a paper, Atom [1], that uses 4-bit quantization, delivering 7.73x throughput compared to FP16 and 2.53x compared to INT8. Atom is able to maintain a perplexity (i.e., model accuracy) close to FP16, outperforming existing quantization approaches. We are polishing the 4-bit code. It will be added to Punica code base soon. Please stay tuned :) [1] https://arxiv.org/abs/2310.19102 |
|
So Atom base models would be compatible with Punica?
I also wonder, many people already train LoRAs in 8 or even 4 bit (for the base model), would it make sense to match the quantization algo used during training and inference?