| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ElectricalUnion 39 days ago
	You need the rest of the ram for the context. If you don't want to end up with a toy context or quantized lossy context, is pretty easy to end up having to spend up 50+GB just for the KV cache, per simutaneous inference slot.