|
|
|
|
|
by gaogao
523 days ago
|
|
This approach likely scales to top models. There was a thought for a while among researchers that scaling up Hogwild / LocalSGD to more parameters would be less convergent, but it seems to be that things actually converge a bit better with larger models, especially if a MoE split is used. |
|