Hacker News new | ask | show | jobs
by andyyyy64 36 days ago
Good catch that's a real gap. The KV estimate is GQA/MQA-aware (per-model head config) but currently assumes dense full-context attention; it does not model sliding-window / chunked attention, so for SWA models like Mistral or Gemma at long context it over-estimates KV. The error is conservative — it tells you a model needs more than it does, not less, so it won't push you into an OOM — but it's still wrong. I'll open a tracking issue with per-architecture window sizes; if you have a reference for the exact SWA configs you care about it'll speed the fix. This is the kind of report I posted for.
1 comments

Nice! Mistral 7B v0.1 is sliding_window: 4096 in the HuggingFace config.json (though v0.2 sets it to null). Gemma 2 alternates sliding window (4096) and full attention every other layer. Both have the field in the model config so maybe you could pull it from the same API you're already using.