|
|
|
|
|
by sandGorgon
1971 days ago
|
|
@benchess - yes this is what i meant. Using the operator framework. But more generally, MPI over ssh on a large kubernetes deployment is not a very common pattern. Any reason you chose that ? Have you looked at Ray or Torch-Elastic (which seems to be officially supported by AWS, etc as well) https://github.com/pytorch/elastic ? |
|