Hacker News new | ask | show | jobs
by Jasssss 39 days ago
The plan command is clever. How do you handle the VRAM estimation for models with sliding window attention vs full context? Something like Mistral at 32k context uses way less KV cache than Llama at the same context length, but from the README it looks like the estimation is based on a fixed context size. Does it account for that?
1 comments

Good catch that's a real gap. The KV estimate is GQA/MQA-aware (per-model head config) but currently assumes dense full-context attention; it does not model sliding-window / chunked attention, so for SWA models like Mistral or Gemma at long context it over-estimates KV. The error is conservative — it tells you a model needs more than it does, not less, so it won't push you into an OOM — but it's still wrong. I'll open a tracking issue with per-architecture window sizes; if you have a reference for the exact SWA configs you care about it'll speed the fix. This is the kind of report I posted for.
Nice! Mistral 7B v0.1 is sliding_window: 4096 in the HuggingFace config.json (though v0.2 sets it to null). Gemma 2 alternates sliding window (4096) and full attention every other layer. Both have the field in the model config so maybe you could pull it from the same API you're already using.