Hacker News new | ask | show | jobs
by tyfon 1166 days ago
I've been testing quite a few of these models lately. For me, the absolute best is still the 65B 4-bit quantized llama model with the correct prompt and parameters, both for programming, language and general questions.

I am actually getting about 2 tokens/second with the latest llama.cpp using 16 threads on a 5950x with 64 gb of ram. 16 threads seems to be the sweet spot, any higher and it slows down, any lower and it is less consistent in the time to produce a token.

I am 100% convinced that the AI "market" will be a local thing. Running this and having access to all the information stored in the weights easily and without internet is just so great I think :)

Edit: the responding to itself "bug" is most likely an issue with the prompt you issue. The recent llama.cpp has a good starting point in the examples/chat-13b.sh I am using a modified version of that where I set the 65B model, change the moscow stuff to cairo and the node.js to a small C program.

4 comments

Sounds like you've had some more success w/ w/ raw LLaMA - would def be interested in how you're prompting it.

BTW, for those interested (looks like the markdown rendering is a bit messed up) but here are some notes I'm taking for some of the nuts and bolts for the local models I'm running: https://mostlyobvious.org/?link=%2FReference%2FSoftware%2FGe...

So I didn't see this before today, I will respond anyway and the siblings can also see. There is quite a big difference between the 65B model and especially the 13B and 7B models.

But here is the bash script[0] I launch my "go to" AI, it's called Omnius :)

As written in the previous comment, it is a modified version of the examples/chat-13b.sh that is included in the llama.cpp github.

[0] https://pastebin.com/SeKE3Uac

Ah thanks a lot, I tried out the llama.cpp examples before the k-shot chat prompts, this is definitely much better!

I have a 5950X as well, but sadly, token generation is a bit too slow for me now. (I've had turbo turned off for efficiency as well, but maybe I'll see if the extra cycles helps).

I'm giving 30B a try on my GPU now with https://github.com/oobabooga/text-generation-webui/wiki/LLaM... and if it's not good then will give layer offloading with 65B a try and see if I can get it running well.

super helpful, thank you.
How do you talk to llama? It doesn't respond to instructions, so it's a bit complicated to have it extract keywords and/or summarize texts. Can you please share examples of llama prompts?
See my sibling comment for a bash script to launch it :)

Edit: I have not tried to use it for extracting keywords / summarizing text. Perhaps I shall experiment a bit with it.

Ah yes I can see the prompt. It's very verbose. I'll try to figure out how to use a verbose prompt to make llama extract keywords/summarize.
I really want to try out the 65B model, I hear great things! Sadly none of my computers can handle it, slightly tempted to get more RAM but I’m on a 8 core i7.
I just bought another 64 for my computer that will arrive in the mail after easter. That will allow me to run the full 65B FP16 model, however it will probably be much slower than the 4 bit quantized version as it has to do more math.

My biggest hope in the end is that we will get a library that can utilize the unified memory model of the AMD platform and run these things with a combination of system ram and GPU. I think intel also has something similar in their platforms.

Not sure how well llama.cpp runs with 8 cores and so big weights though, I am really pushing how usable it is due to speed with my 5950x already.

Perhaps we'll even get dedicated AI boards in the end, much like GPUs today.

Could you share your prompts and parameters? I tried it and I didn't really seem to get much better results than ChatGPT or others.