| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rushingcreek 652 days ago
	We run the nodes "hot" and close to overload for peak throughput. That's why NVIDIA's XQA innovation was so interesting, because it allows for much higher throughput for a given latency budget: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source.... Serverless would make more sense if we had a significant underutilization problem.