Hacker News new | ask | show | jobs
by zhisbug 1166 days ago
Lmsys hasn't released any official 4-bit version. It might be a better idea to wait for the official 4-bit version. But it is interesting to learn that the third-party 4bit version has performance degeneration.
1 comments

Lmsys hasn't released any official weights for anything. They've released "deltas" and other people have applied those deltas to the appropriate llama weights and done the quantization.

I reject your premise that the 8 to 4 bit quantization is the cause of the vicuna fine-tuned llamas very average performance though. This hasn't been the case for any of the other 8 to 4 bit quantizations. It would be a unique outlier. And so I don't think this is the "cause" here.

And I think the problem of taking the roles of users in vicuna is caused by this bug: https://github.com/lm-sys/FastChat/commit/1bb234265d16bdfd50...

which has been fixed recently.

Lmsys are launching new training jobs after this patch, please stay tuned.

Nah, I don't use huggingface transformers to run inference with the vicuna model. I use llama.cpp. But I do appreciate the tip.

edit: Oh, I was completely wrong. That's in the training not the inference so it applies to all the weights.

My point is that I am not aware of any official 4-bit quantization version (delta or weights) by lmsys so it might too early to draw your conclusion that vicuna finetuned llamas degenerates a lot of performance at 4 bit but others are fine.