|
|
|
|
|
by necroforest
1003 days ago
|
|
LLMs are trained in parallel. The model weights and optimizer state are split over a number (possibly thousands) of accelerators. The main bottleneck to doing distributed training like this is the communication between nodes. |
|