Hacker News new | ask | show | jobs
by superkuh 1166 days ago
My one take away after playing with both chat mode and text completion modes is that gpt4all 7B 4bit stays on the chat rails (doesn't start taking the role of the user, or spewing fine tuning boilerplate) much better than vicuna 7B 4bit. In text completion they're about the same but I'd still prefer the vanilla llama 7B in that case.

There are a couple versions of gpt4all fine-tuned llama 7B and my favorite is the unfiltered one (gpt4all-lora-unfiltered-quantized.bin). https://github.com/nomic-ai/gpt4all#try-it-yourself

1 comments

Lmsys hasn't released any official 4-bit version. It might be a better idea to wait for the official 4-bit version. But it is interesting to learn that the third-party 4bit version has performance degeneration.
Lmsys hasn't released any official weights for anything. They've released "deltas" and other people have applied those deltas to the appropriate llama weights and done the quantization.

I reject your premise that the 8 to 4 bit quantization is the cause of the vicuna fine-tuned llamas very average performance though. This hasn't been the case for any of the other 8 to 4 bit quantizations. It would be a unique outlier. And so I don't think this is the "cause" here.

And I think the problem of taking the roles of users in vicuna is caused by this bug: https://github.com/lm-sys/FastChat/commit/1bb234265d16bdfd50...

which has been fixed recently.

Lmsys are launching new training jobs after this patch, please stay tuned.

Nah, I don't use huggingface transformers to run inference with the vicuna model. I use llama.cpp. But I do appreciate the tip.

edit: Oh, I was completely wrong. That's in the training not the inference so it applies to all the weights.

My point is that I am not aware of any official 4-bit quantization version (delta or weights) by lmsys so it might too early to draw your conclusion that vicuna finetuned llamas degenerates a lot of performance at 4 bit but others are fine.