Is it possible to use GPU for this? With R9 7900x and 32GB RAM it takes 15-30sec to generate response. I have a 6900XT which might be more suited for this.
./server -t 8 -m models/wizardcoder-python-34b-v1.0.Q4_K_S.gguf -c 16384 --mlock -ngl 60
[Edit: make sure to compile llama.cpp with GPU support first, e.g., "make clean && LLAMA_CUBLAS=1 make -j"]
[Edit: make sure to compile llama.cpp with GPU support first, e.g., "make clean && LLAMA_CUBLAS=1 make -j"]