Hacker News new | ask | show | jobs
by npsomaratna 1023 days ago
Yes. In the llama.cpp server command, specify the number of layers you'd like offloaded to your GPU via the -ngl parameter, e.g.:

  ./server -t 8 -m models/wizardcoder-python-34b-v1.0.Q4_K_S.gguf -c 16384 --mlock -ngl 60
(You might need to play around with the number of layers.)

[Edit: make sure to compile llama.cpp with GPU support first, e.g., "make clean && LLAMA_CUBLAS=1 make -j"]