Hacker News new | ask | show | jobs
by uptownfunk 1159 days ago
Why does AWS seem to suggest otherwise?

>> With only a few lines of additional code, you can add either data parallelism or model parallelism to your PyTorch and TensorFlow training scripts and Amazon SageMaker will apply your selected method for you. SageMaker will determine the best approach to split your model by using graph partitioning algorithms to balance the computation of each GPU while minimizing the communication between GPU instances. SageMaker also optimizes your distributed training jobs through algorithms that are designed to fully utilize AWS compute and network infrastructure in order to achieve near-linear scaling efficiency, which allows you to complete training faster than manual implementations.

https://aws.amazon.com/sagemaker/distributed-training/

2 comments

There's still a lot of data that needs to be passed between GPUs. SageMaker is "minimizing the communication," but it's still way more than nothing, and all the gradients need to be communicated roughly every iteration. That's ok to send between computers with high speed datacenter links, but much more than you would ever want to send across the internet repeatedly.
Some computations work well with large-chunk parallelism, but not fine-grain parallelism.

At some point, greater distribution of computing, especially across a large distributed network (the Internet!), and for smaller individual computing systems, means the time cost of combining or synchronizing intermediate calculations is far greater, than any benefit of adding another computing node.