|
|
|
|
|
by DARSHANFOFADIYA
118 days ago
|
|
I've been working on optimizing training for long-context models (70B+) and found that while Tensor Parallelism is well-documented, the newer "Unified" Sequence Parallelism techniques (like DeepSpeed Ulysses) are often treated as black boxes. I wrote this deep dive to visualize exactly how we shard the Q, K, V projections and how the All-to-All communication primitives work during the attention step to handle 1M+ tokens. The post covers: The architectural difference between Ring Attention and Ulysses (and why Ulysses often wins on H100 clusters). Diagrams of the specific "All-to-All" communication steps. How to handle the KV-cache bottleneck without exploding memory. Happy to answer questions about the implementation or the communication cost analysis! |
|