| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by brianjking 939 days ago
	My issue is figuring out how to identify how many concurrent users you can support on average on a given GPU. Understanding the vram to simply load the weights is easy enough. When you are allowing for something like content generation with varying lengths of input/output tokens, how do you even begin to identify the GPUs you need?