| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sandGorgon 1757 days ago

the hardest part here is horizontal scaling. OpenAI handrolled its own MPI+SSH stack (https://openai.com/blog/scaling-kubernetes-to-7500-nodes/)

I wonder what is the state of art for horizontal scaling here ...preferably on kubernetes.

Pytorch is tricky to integrate (using TorchElastic). You could use Dask or Ray Distributed. Tensorflow has its own mechanism that doesnt play nice with Kubernetes.

How are others doing it ?