Hacker News new | ask | show | jobs
by alexeldeib 251 days ago
Larger memory, weaker comms. You can optimize for this by doing things like increasing batch size/data parallelism vs sharding schemes with more comms.

At scale training won’t be able to avoid comms entirely, while many models can fit in a single MI300 for serving.