| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by solatic 208 days ago
	This. I may not work with AI training workflows, but I struggle to understand why they supposedly require launching a thousand pods per second to use GPUs that need to fundamentally be installed across different baremetal machines. Once the GPUs are on different machines, if there are 1k+ such machines, just start putting them on different Kubernetes clusters. Build a scheduling layer above the Kubernetes control plane to decide which Kubernetes cluster to schedule the pod onto. The whole thing stinks of, AI investors are throwing money at AI companies, so go to GCP and tell them to solve the problem at any price so that they can keep scaling without needing to build the scheduling layer above the Kubernetes control planes.

1 comments

zkmon 207 days ago

Yep, it's just saying "you should now launch 1000's pods in a single cluster, just because we said it makes sense, and please don't look at the costs, business sense and operational issues."