| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by petuman 358 days ago
	> assuming I need 1k tokens/second throughput (on each, so 20 x 1k) 3.6B activated at Q8 x 1000 t/s = 3.6TB/s just for activated model weights (there's also context). So pretty much straight to B200 and alike. 1000 t/s per user/agent is way too fast, make it 300 t/s and you could get away with 5090/RTX PRO 6000.