|
|
|
|
|
by sandGorgon
1710 days ago
|
|
the hardest part here is horizontal scaling. OpenAI handrolled its own MPI+SSH stack (https://openai.com/blog/scaling-kubernetes-to-7500-nodes/) I wonder what is the state of art for horizontal scaling here ...preferably on kubernetes. Pytorch is tricky to integrate (using TorchElastic). You could use Dask or Ray Distributed. Tensorflow has its own mechanism that doesnt play nice with Kubernetes. How are others doing it ? |
|