|
|
|
|
|
by syntaxing
1174 days ago
|
|
I’ve been using the GPTQ 4 bit quantized 13B with Text generation web UI and it’s been amazing. Probably the closest to ChatGPT I have used so far. I still get an issue where it keeps on talking to itself by generating its own prompt and then answering it. Has anyone experienced the same thing? |
|
I am actually getting about 2 tokens/second with the latest llama.cpp using 16 threads on a 5950x with 64 gb of ram. 16 threads seems to be the sweet spot, any higher and it slows down, any lower and it is less consistent in the time to produce a token.
I am 100% convinced that the AI "market" will be a local thing. Running this and having access to all the information stored in the weights easily and without internet is just so great I think :)
Edit: the responding to itself "bug" is most likely an issue with the prompt you issue. The recent llama.cpp has a good starting point in the examples/chat-13b.sh I am using a modified version of that where I set the 65B model, change the moscow stuff to cairo and the node.js to a small C program.