|
|
|
|
|
by neilmovva
3050 days ago
|
|
It matters when defining parallel work distribution. Unless memory bandwidth is homogeneous across the whole board (i.e. each TPU on a board gets 600 GB/s to its peers), we can't do model parallelism across ASICs efficiently, and must fall back to data parallelism. Which is fine, until you run into limits on maximum batchsize (e.g. up to 8192, as FAIR was able to manage [1] with some tweaks to SGD). [1] https://arxiv.org/abs/1706.02677 |
|