Hacker News new | ask | show | jobs
by superkuh 1166 days ago
Lmsys hasn't released any official weights for anything. They've released "deltas" and other people have applied those deltas to the appropriate llama weights and done the quantization.

I reject your premise that the 8 to 4 bit quantization is the cause of the vicuna fine-tuned llamas very average performance though. This hasn't been the case for any of the other 8 to 4 bit quantizations. It would be a unique outlier. And so I don't think this is the "cause" here.

2 comments

And I think the problem of taking the roles of users in vicuna is caused by this bug: https://github.com/lm-sys/FastChat/commit/1bb234265d16bdfd50...

which has been fixed recently.

Lmsys are launching new training jobs after this patch, please stay tuned.

Nah, I don't use huggingface transformers to run inference with the vicuna model. I use llama.cpp. But I do appreciate the tip.

edit: Oh, I was completely wrong. That's in the training not the inference so it applies to all the weights.

My point is that I am not aware of any official 4-bit quantization version (delta or weights) by lmsys so it might too early to draw your conclusion that vicuna finetuned llamas degenerates a lot of performance at 4 bit but others are fine.