Hacker News new | ask | show | jobs
by georgehotz 655 days ago
That OCP 3.0 card has the same link bandwidth as the GPUs, so you can scale out without much loss of all-reduce bandwidth. In practice, for all models except the largest, the ~16GB/s all-reduce is totally fine. You just need to make sure you can all-reduce all weights in your training step time.

Say you are training a 3B parameter model in BF16. That's 6GB of weights, as long as your step time is >=500ms you won't see a slowdown.

1 comments

> 3B parameter model

That's tiny. Can it train/fine-tune 70B models?