Hacker News new | ask | show | jobs
by greyskull 64 days ago
Thanks! These things you're mentioning like "You may be able to offload some layers to GPU...", "You can keep the KV cache on GPU..." configured as part of the llama.cpp? I wouldn't know what to prompt with or how to evaluate "correctness" (outside of literally feeding your comment into claude and seeing what happens).

Aside: what is your tooling setup? Which harness you're using (if any), what's running the inference and where, what runs in WSL vs Windows, etc.

I struggle to even ask the right questions about the workflow and environment.

1 comments

Yes fair enough, but try feeding my comment in :). It should be enough for it to go on. Then ask it to explain the concepts I mentioned and ask it to suggest follow-up questions for you to learn more about llama.cpp/local inference!

I've had best results with opencode. Running locally w/ 64GB RAM and Radeon 9070XT (16GB). NVidia should be easier (CUDA), I'm on Linux full time now but used to use WSL2 all the time and had all this working in it.