|
|
|
|
|
by benchess
1966 days ago
|
|
Hi, co-author here! We use a pretty standard tech stack of PyTorch + NCCL + MPI. We've used both OpenMPI and MPICH to varying degrees. Kubeflow is interesting, but it solves a slightly different problem of scheduling/coordinating ML workflows on top of Kube. It doesn't get involved with how an ML job communicates within itself cross-node. |
|
One difference is that these operators use the Kubernetes service interface for communication, generally exposing a headless service for each replica.