If you want my best guess: I think large context windows cannot be trained properly. There's not enough material, nor computing power, to train such large networks (to the same degree as small windows).
I feel this is a sort of inverse inspection paradox (the paradox that if you sample waiting time in a process, you’re more likely to sample a larger value).
The LLM providers fine tune the models with some kind of information retrieval tasks, but to do so you must provide some non relevant context to bootstrap the session for the long context tasks.
It would be very easy to do this in ways that train the sequence model to treat early history as noisier than it really is, or to weaken its relationship to late context.
You’re also probably stacking more contexts together with long contexts (start with task A, then detour to solving B and C before you can complete A).
Training sequence lengths probably decay super linearly with length creating far fewer samples at long length during training.
> (the paradox that if you sample waiting time in a process, you’re more likely to sample a larger value).
The deepseek v4 paper talks about one variant of this (related to failures) and how they mitigate it.
>During preemption, we pause the inference engine and save the KV cache of
unfinished requests. Upon resumption, we use the persisted WALs and saved KV cache to continue decoding. Even when a fatal hardware error occurs, we can re-run the prefill phase using the persisted tokens in WAL to reconstruct the KV cache.
>Importantly, it is mathematically incorrect to regenerate unfinished requests from scratch,
as this introduces length bias. Because shorter responses are more likely to survive interruption, regenerating from scratch makes the model more prone to producing shorter sequences whenever an interruption occurs. If the inference stack is batch-invariant and deterministic,
this correctness issue could also be addressed by regenerating with a consistent seed for the
pseudorandom number generator used in the sampler. However, this approach still incurs the
extra cost of re-running the decoding phase, making it far less efficient than our token-granular
WAL method.
The LLM providers fine tune the models with some kind of information retrieval tasks, but to do so you must provide some non relevant context to bootstrap the session for the long context tasks.
It would be very easy to do this in ways that train the sequence model to treat early history as noisier than it really is, or to weaken its relationship to late context.
You’re also probably stacking more contexts together with long contexts (start with task A, then detour to solving B and C before you can complete A).
Training sequence lengths probably decay super linearly with length creating far fewer samples at long length during training.