|
|
|
|
|
by alexeldeib
251 days ago
|
|
Larger memory, weaker comms. You can optimize for this by doing things like increasing batch size/data parallelism vs sharding schemes with more comms. At scale training won’t be able to avoid comms entirely, while many models can fit in a single MI300 for serving. |
|