|
|
|
|
|
by ClaireGz
108 days ago
|
|
This is super helpful — most writeups skip over the actual communication steps, so seeing the All-to-All flow laid out makes it much clearer. Curious from your experiments: at 1M+ context, does communication start dominating vs compute? I keep seeing cases where bigger context windows are technically possible but don’t translate into better results unless the context is very structured, so I wonder where the real scaling limit ends up being in practice. |
|
The quality degradation as context length increaes is a whole another science problem