Hacker News new | ask | show | jobs
by DARSHANFOFADIYA 118 days ago
I've been working on optimizing training for long-context models (70B+) and found that while Tensor Parallelism is well-documented, the newer "Unified" Sequence Parallelism techniques (like DeepSpeed Ulysses) are often treated as black boxes.

I wrote this deep dive to visualize exactly how we shard the Q, K, V projections and how the All-to-All communication primitives work during the attention step to handle 1M+ tokens.

The post covers:

The architectural difference between Ring Attention and Ulysses (and why Ulysses often wins on H100 clusters).

Diagrams of the specific "All-to-All" communication steps.

How to handle the KV-cache bottleneck without exploding memory.

Happy to answer questions about the implementation or the communication cost analysis!