Hacker News new | ask | show | jobs
by xg15 9 days ago
...not to mention, which KV cache? Every attention module has its own, and how many attention modules there are, what inputs they get, how many internal features and attention heads they have, etc, all depends on the architecture of the specific model.