Hacker News new | ask | show | jobs
by mmq 1966 days ago
Probably OP was referring to the MPIOperator, TFOperator, PytorchOperator, ... they are under the Kuberflow org, but can be deployed independently of Kubeflow itself. Several other projects are using those operators to provide similar abstractions you mentioned in your blog post, e.g. Gang scheduling, cross-nodes communication, ...

One difference is that these operators use the Kubernetes service interface for communication, generally exposing a headless service for each replica.

1 comments

@benchess - yes this is what i meant. Using the operator framework.

But more generally, MPI over ssh on a large kubernetes deployment is not a very common pattern. Any reason you chose that ?

Have you looked at Ray or Torch-Elastic (which seems to be officially supported by AWS, etc as well) https://github.com/pytorch/elastic ?