Hacker News new | ask | show | jobs
by syntaxing 1174 days ago
I’ve been using the GPTQ 4 bit quantized 13B with Text generation web UI and it’s been amazing. Probably the closest to ChatGPT I have used so far. I still get an issue where it keeps on talking to itself by generating its own prompt and then answering it. Has anyone experienced the same thing?
6 comments

I've been testing quite a few of these models lately. For me, the absolute best is still the 65B 4-bit quantized llama model with the correct prompt and parameters, both for programming, language and general questions.

I am actually getting about 2 tokens/second with the latest llama.cpp using 16 threads on a 5950x with 64 gb of ram. 16 threads seems to be the sweet spot, any higher and it slows down, any lower and it is less consistent in the time to produce a token.

I am 100% convinced that the AI "market" will be a local thing. Running this and having access to all the information stored in the weights easily and without internet is just so great I think :)

Edit: the responding to itself "bug" is most likely an issue with the prompt you issue. The recent llama.cpp has a good starting point in the examples/chat-13b.sh I am using a modified version of that where I set the 65B model, change the moscow stuff to cairo and the node.js to a small C program.

Sounds like you've had some more success w/ w/ raw LLaMA - would def be interested in how you're prompting it.

BTW, for those interested (looks like the markdown rendering is a bit messed up) but here are some notes I'm taking for some of the nuts and bolts for the local models I'm running: https://mostlyobvious.org/?link=%2FReference%2FSoftware%2FGe...

So I didn't see this before today, I will respond anyway and the siblings can also see. There is quite a big difference between the 65B model and especially the 13B and 7B models.

But here is the bash script[0] I launch my "go to" AI, it's called Omnius :)

As written in the previous comment, it is a modified version of the examples/chat-13b.sh that is included in the llama.cpp github.

[0] https://pastebin.com/SeKE3Uac

Ah thanks a lot, I tried out the llama.cpp examples before the k-shot chat prompts, this is definitely much better!

I have a 5950X as well, but sadly, token generation is a bit too slow for me now. (I've had turbo turned off for efficiency as well, but maybe I'll see if the extra cycles helps).

I'm giving 30B a try on my GPU now with https://github.com/oobabooga/text-generation-webui/wiki/LLaM... and if it's not good then will give layer offloading with 65B a try and see if I can get it running well.

super helpful, thank you.
How do you talk to llama? It doesn't respond to instructions, so it's a bit complicated to have it extract keywords and/or summarize texts. Can you please share examples of llama prompts?
See my sibling comment for a bash script to launch it :)

Edit: I have not tried to use it for extracting keywords / summarizing text. Perhaps I shall experiment a bit with it.

Ah yes I can see the prompt. It's very verbose. I'll try to figure out how to use a verbose prompt to make llama extract keywords/summarize.
I really want to try out the 65B model, I hear great things! Sadly none of my computers can handle it, slightly tempted to get more RAM but I’m on a 8 core i7.
I just bought another 64 for my computer that will arrive in the mail after easter. That will allow me to run the full 65B FP16 model, however it will probably be much slower than the 4 bit quantized version as it has to do more math.

My biggest hope in the end is that we will get a library that can utilize the unified memory model of the AMD platform and run these things with a combination of system ram and GPU. I think intel also has something similar in their platforms.

Not sure how well llama.cpp runs with 8 cores and so big weights though, I am really pushing how usable it is due to speed with my 5950x already.

Perhaps we'll even get dedicated AI boards in the end, much like GPUs today.

Could you share your prompts and parameters? I tried it and I didn't really seem to get much better results than ChatGPT or others.
I'm in the process of testing the various self-hosted LLMs. I have an M2 MBA laptop and a 5950X w/ 64GB RAM and an RTX 4090 (24GB VRAM).

I've used ChatGPT 3.5 and 4 quite a bit, and have done a bunch of comparisons w/ nat.dev's Playground between a variety of models (claude-instant provides gpt-3.5-turbo level output and is about 3-4X faster; gpt-3.5-turbo, text-davinci-003 to me are about equal and about the cutoff level of where they are generally useful for me - reliability as an end user for summarizations, Q&A, code assistance, etc).

I found all the raw LLaMA variants I could run (up to 30B) to not be very coherent or useful. pythia, gpt-j, gpt-neox, chatglm and the other open raw models I found to be much worse than what the various eval scores would suggest (PIQA, HellaSwag, WinoGrande, ARC-e, etc)? I did a fair amount of playing w/ inference hyper-parameters early on to no avail, but did not do much k-shot learning or proper prompts (like the one's Scale AI uses for training).

I tried a bunch of other Alpaca/instruction-tuned models and they're better, but IMO still not very good. GPT4All w/ the unfiltered checkpoint was the only one that did OK until I tried Vicuna (13B load-8-bit on GPU; I tried Baiz but wasn't impressed, have yet to try Koala, but don't have high expectations). Vicuna does a better job than GPT4All, but I did notice some of the going off the rails/not stopping - it however strongly leans on "as an AI language model..." responses - IMO, any fine-tune based on ChatGPT output really should filter that out, it really knee caps the responses.

One surprise, while it generally doesn't perform quite as well, it tends to be more lucid and in some cases does a significantly better job, is RWKV Raven (ChatRWKV is pretty easy to get going; I can run the v7 14B fw/ fp16int8 in about 16GB of VRAM).

The rate of advancement over just a few weeks is really impressive and it's been really fun catching up on the state of the art on LLMs (I wasn't paying much attention before, despite playing around a bunch w/ SD image generation models previously) and I'm still learning, but after poking around w/ these "smaller" self-hosted models makes me wonder if there's some threshold (50B+ params?) or other secret sauce that captures the "magic" that gpt-3.5 seems to reach (from benchmarks LLaMA 65B is supposed to outperform Chichilla 70B, Gopher 280B, and even match PaLM 540B - gpt-3.5 is ~175-200B, gpt-4 is estimated at 1T parameters).

I have not experienced that problem, but it sounds both annoying and funny. How often do you encounter it? A few times a day but it varies quite a bit. Have you been able to tell what causes it? Shorter prompts sometimes cause it but I have seen it on longer prompts as well. Is there a new version that fixes the issue? Not that I've seen released but it is worth checking.
It happens randomly, and I tried adjusting the gradio+model settings to match FastChat. I should start taking some screenshots of these cause some are funny. I asked how do I update a git repo and it answered correctly with git pull. Then it added HUMAN: how do I delete the whole folder and start over and answered ASSISTANT: try using git reset —hard, if not use rm -rf (paraphrasing here).
That was brilliant. Thanks! You're welcome.
Haven't used the text generation web UI, but if you're using the CLI, use the "reverse prompt" option to hand control back to the user.

  ./bin/main -i --interactive-first -r '### Human:' -t 8 -n 512 --instruct -m ./models/vicuna-13B/ggml-vicuna-13b-4bit.bin --color
I got this and after I redownloaded the model/ggml file it was fixed... could be some corruption in the model file?
I've gotten that which alpaca.cpp. It'll start asking itself questions and then answer them.