| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by avilay 1959 days ago
	Cool, thanks for the response. Yes, I do find that the PyTorch tutorials on distributed training are a work-in-progress. I was thinking of starting with a basic implementation of the original paper by Jeff Dean, et. al. on synchronized data parallelism, implement basic model parallelism, explain why async parallelism works, do a simple implementation of HOGWILD!, and finally do "hello world" training using existing distributed training systems like Horovod, Distributed PyTorch, RayLib, Microsoft DeepSpeed, etc.

1 comments

p1esk 1959 days ago

"Hello world" examples already exist for all of those. Reproducing them is not very interesting. If you're willing to dive a little deeper, try to implement SyncBatchnorm: explain design choices, measure the performance impact, describe any bugs you had in your implementation. Such a case study would be very interesting to read, and would probably get you noticed.

link