Hacker News new | ask | show | jobs
by avilay 1912 days ago
Cool, thanks for the response. Yes, I do find that the PyTorch tutorials on distributed training are a work-in-progress.

I was thinking of starting with a basic implementation of the original paper by Jeff Dean, et. al. on synchronized data parallelism, implement basic model parallelism, explain why async parallelism works, do a simple implementation of HOGWILD!, and finally do "hello world" training using existing distributed training systems like Horovod, Distributed PyTorch, RayLib, Microsoft DeepSpeed, etc.

1 comments

"Hello world" examples already exist for all of those. Reproducing them is not very interesting. If you're willing to dive a little deeper, try to implement SyncBatchnorm: explain design choices, measure the performance impact, describe any bugs you had in your implementation. Such a case study would be very interesting to read, and would probably get you noticed.