Hacker News new | ask | show | jobs
by k4rli 1028 days ago
Thanks, works nicely and easy to set up.

Is it possible to use GPU for this? With R9 7900x and 32GB RAM it takes 15-30sec to generate response. I have a 6900XT which might be more suited for this.

1 comments

Yes. In the llama.cpp server command, specify the number of layers you'd like offloaded to your GPU via the -ngl parameter, e.g.:

  ./server -t 8 -m models/wizardcoder-python-34b-v1.0.Q4_K_S.gguf -c 16384 --mlock -ngl 60
(You might need to play around with the number of layers.)

[Edit: make sure to compile llama.cpp with GPU support first, e.g., "make clean && LLAMA_CUBLAS=1 make -j"]