| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by GreyOcten 10 hours ago
	handy, but the gap most of these filters have is that "fits in VRAM" doesn't mean usable. context length blows up the KV cache fast, a 7B that fits at 2k tokens will OOM at 32k. factoring context len + quant into the estimate is where it'd actually save people from getting burned.

1 comments

mzubairtahir 4 hours ago

i think you did not check app properly, it is actually taking required context window from the user and then caluclate kv cache size and then count it along with size of model itself. it also reserves some more memory to avoid oom....

link