| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thedevindevops 1159 days ago
	ML training is iterative and non-parallelizable so breaking it up into distributable units of work would not provide any benefits and would actually slow down learning.

3 comments

uptownfunk 1159 days ago

Why does AWS seem to suggest otherwise?

>> With only a few lines of additional code, you can add either data parallelism or model parallelism to your PyTorch and TensorFlow training scripts and Amazon SageMaker will apply your selected method for you. SageMaker will determine the best approach to split your model by using graph partitioning algorithms to balance the computation of each GPU while minimizing the communication between GPU instances. SageMaker also optimizes your distributed training jobs through algorithms that are designed to fully utilize AWS compute and network infrastructure in order to achieve near-linear scaling efficiency, which allows you to complete training faster than manual implementations.

https://aws.amazon.com/sagemaker/distributed-training/

link

lostdog 1159 days ago

There's still a lot of data that needs to be passed between GPUs. SageMaker is "minimizing the communication," but it's still way more than nothing, and all the gradients need to be communicated roughly every iteration. That's ok to send between computers with high speed datacenter links, but much more than you would ever want to send across the internet repeatedly.

link

Nevermark 1159 days ago

Some computations work well with large-chunk parallelism, but not fine-grain parallelism.

At some point, greater distribution of computing, especially across a large distributed network (the Internet!), and for smaller individual computing systems, means the time cost of combining or synchronizing intermediate calculations is far greater, than any benefit of adding another computing node.

link

muzani 1159 days ago

TIL Skynet needs deep work and focus time too

link

j4hdufd8 1158 days ago

Calling ML training non-parallelizable is very strong, no? Matrix multiplication is highly parallelizable.

link