Hacker News new | ask | show | jobs
by ilaksh 911 days ago
Try https://github.com/ggerganov/llama.cpp

Builds very quickly with make. But if it's slow when you try it then make sure to enable any flags related to CUDA and then try the build again.

A key parameter is the one that tells it how many layers to offload to the GPU. ngl I think.

Also, download the 4 bit GGUF from HuggingFace and try that. Uses much less memory.

1 comments

with llama.cpp and a 12gb 3060 they can get the an entire mistral model at Q5_K_M n ram with the full 32k context. I recommend openhermes-2.5-mistral-7b-16k with USER: ASSISTANT: instructions, it's working surprisingly well for content production (let's say everything except logic and math, but that's not the strong suite of 7b models in general)