Hacker News new | ask | show | jobs
by doppelgunner 311 days ago
Tried running a local LLM and it felt like adopting a pet dragon. Fun at first, but then it keeps eating all my GPU and still refuses to clean up its own context window.
1 comments

Haha, a cute pet dragon. Two knobs that helped me tame VRAM: KV‑cache quant/eviction and sliding‑window attention (if your runtime supports them). What model/runtime and context are you running when it tips over? Are you using Ollama?