| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Too 339 days ago
	When inference requires maxing out the memory of a gpu, where are you planning to keep this cache? Unless there is a way to compress the context into a more manageable snapshot, the cloud provider surely won’t keep a gpu idling just for holding a conversation in memory.