Hacker News new | ask | show | jobs
by necroforest 1003 days ago
LLMs are trained in parallel. The model weights and optimizer state are split over a number (possibly thousands) of accelerators.

The main bottleneck to doing distributed training like this is the communication between nodes.