| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by moffkalast 680 days ago
	Well that and you also need a fair bit more space for the KV cache which can be a bit unpredictable. Models without GQA, flash attention or 4 bit cache support are really terrible in that regard, plus it depends on context length. Haven't found a good rule of thumb for that yet.