|
|
|
|
|
by tyfon
1166 days ago
|
|
I've been testing quite a few of these models lately.
For me, the absolute best is still the 65B 4-bit quantized llama model with the correct prompt and parameters, both for programming, language and general questions. I am actually getting about 2 tokens/second with the latest llama.cpp using 16 threads on a 5950x with 64 gb of ram. 16 threads seems to be the sweet spot, any higher and it slows down, any lower and it is less consistent in the time to produce a token. I am 100% convinced that the AI "market" will be a local thing. Running this and having access to all the information stored in the weights easily and without internet is just so great I think :) Edit: the responding to itself "bug" is most likely an issue with the prompt you issue. The recent llama.cpp has a good starting point in the examples/chat-13b.sh
I am using a modified version of that where I set the 65B model, change the moscow stuff to cairo and the node.js to a small C program. |
|
BTW, for those interested (looks like the markdown rendering is a bit messed up) but here are some notes I'm taking for some of the nuts and bolts for the local models I'm running: https://mostlyobvious.org/?link=%2FReference%2FSoftware%2FGe...