|
I'm in the process of testing the various self-hosted LLMs. I have an M2 MBA laptop and a 5950X w/ 64GB RAM and an RTX 4090 (24GB VRAM). I've used ChatGPT 3.5 and 4 quite a bit, and have done a bunch of comparisons w/ nat.dev's Playground between a variety of models (claude-instant provides gpt-3.5-turbo level output and is about 3-4X faster; gpt-3.5-turbo, text-davinci-003 to me are about equal and about the cutoff level of where they are generally useful for me - reliability as an end user for summarizations, Q&A, code assistance, etc). I found all the raw LLaMA variants I could run (up to 30B) to not be very coherent or useful. pythia, gpt-j, gpt-neox, chatglm and the other open raw models I found to be much worse than what the various eval scores would suggest (PIQA, HellaSwag, WinoGrande, ARC-e, etc)? I did a fair amount of playing w/ inference hyper-parameters early on to no avail, but did not do much k-shot learning or proper prompts (like the one's Scale AI uses for training). I tried a bunch of other Alpaca/instruction-tuned models and they're better, but IMO still not very good. GPT4All w/ the unfiltered checkpoint was the only one that did OK until I tried Vicuna (13B load-8-bit on GPU; I tried Baiz but wasn't impressed, have yet to try Koala, but don't have high expectations). Vicuna does a better job than GPT4All, but I did notice some of the going off the rails/not stopping - it however strongly leans on "as an AI language model..." responses - IMO, any fine-tune based on ChatGPT output really should filter that out, it really knee caps the responses. One surprise, while it generally doesn't perform quite as well, it tends to be more lucid and in some cases does a significantly better job, is RWKV Raven (ChatRWKV is pretty easy to get going; I can run the v7 14B fw/ fp16int8 in about 16GB of VRAM). The rate of advancement over just a few weeks is really impressive and it's been really fun catching up on the state of the art on LLMs (I wasn't paying much attention before, despite playing around a bunch w/ SD image generation models previously) and I'm still learning, but after poking around w/ these "smaller" self-hosted models makes me wonder if there's some threshold (50B+ params?) or other secret sauce that captures the "magic" that gpt-3.5 seems to reach (from benchmarks LLaMA 65B is supposed to outperform Chichilla 70B, Gopher 280B, and even match PaLM 540B - gpt-3.5 is ~175-200B, gpt-4 is estimated at 1T parameters). |