|
|
|
|
|
by solatic
208 days ago
|
|
This. I may not work with AI training workflows, but I struggle to understand why they supposedly require launching a thousand pods per second to use GPUs that need to fundamentally be installed across different baremetal machines. Once the GPUs are on different machines, if there are 1k+ such machines, just start putting them on different Kubernetes clusters. Build a scheduling layer above the Kubernetes control plane to decide which Kubernetes cluster to schedule the pod onto. The whole thing stinks of, AI investors are throwing money at AI companies, so go to GCP and tell them to solve the problem at any price so that they can keep scaling without needing to build the scheduling layer above the Kubernetes control planes. |
|