Hacker News new | ask | show | jobs
by sandGorgon 1971 days ago
@benchess - yes this is what i meant. Using the operator framework.

But more generally, MPI over ssh on a large kubernetes deployment is not a very common pattern. Any reason you chose that ?

Have you looked at Ray or Torch-Elastic (which seems to be officially supported by AWS, etc as well) https://github.com/pytorch/elastic ?