| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bostonsre 1178 days ago
	You don't want idle containers on gpus. Something like kserve which sits on knative which is similar to aws lambda is pretty useful and allows scaling deployments to 0. There is some request buffering before the containers and scaling based on the number of concurrent requests a container can support since almost all of these deployed model inference services are gpu and cpu bound, you don't want to route more requests than it can handle because cpu/gpu contention harms throughput.