|
|
|
|
|
by rcxdude
762 days ago
|
|
It's not a task which benefits much from dividing it into lots of small work units that all get processed in parallel without much communication between the nodes. It's naturally almost the complete opposite: it wants very high bandwidth between all the compute units, because each iteration of the training is calculating the derivative of and then updating all the weights of the network. Splitting it up only slows it down: even if you were to distribute training amongst 10x the compute nodes each of which was 10x faster, if your bandwidth drops to even 1/2 you're gonna lose out. This is why all the really big models need a lot of very tightly integrated hardware. |
|