Hacker News new | ask | show | jobs
by p1esk 1915 days ago
I'd be interested in Distributed Deep Learning with PyTorch, but only if you really know what you're talking about. I wouldn't want you to repeat what is already on pytorch.org on this topic.
1 comments

Cool, thanks for the response. Yes, I do find that the PyTorch tutorials on distributed training are a work-in-progress.

I was thinking of starting with a basic implementation of the original paper by Jeff Dean, et. al. on synchronized data parallelism, implement basic model parallelism, explain why async parallelism works, do a simple implementation of HOGWILD!, and finally do "hello world" training using existing distributed training systems like Horovod, Distributed PyTorch, RayLib, Microsoft DeepSpeed, etc.

"Hello world" examples already exist for all of those. Reproducing them is not very interesting. If you're willing to dive a little deeper, try to implement SyncBatchnorm: explain design choices, measure the performance impact, describe any bugs you had in your implementation. Such a case study would be very interesting to read, and would probably get you noticed.