|
|
|
|
|
by greyskull
64 days ago
|
|
Thanks! These things you're mentioning like "You may be able to offload some layers to GPU...", "You can keep the KV cache on GPU..." configured as part of the llama.cpp? I wouldn't know what to prompt with or how to evaluate "correctness" (outside of literally feeding your comment into claude and seeing what happens). Aside: what is your tooling setup? Which harness you're using (if any), what's running the inference and where, what runs in WSL vs Windows, etc. I struggle to even ask the right questions about the workflow and environment. |
|
I've had best results with opencode. Running locally w/ 64GB RAM and Radeon 9070XT (16GB). NVidia should be easier (CUDA), I'm on Linux full time now but used to use WSL2 all the time and had all this working in it.