|
|
|
|
|
by cyanf
272 days ago
|
|
> On August 29, a routine load balancing change unintentionally increased the number of short-context requests routed to the 1M context servers. At the worst impacted hour on August 31, 16% of Sonnet 4 requests were affected. Interesting, this implies that the 1M context servers performs worst at low context.
Perhaps this is due to some KV cache compression, eviction or sparse attention scheme being applied on these 1M context servers? |
|
> All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required. It is also recommended to modify the factor as needed. For example, if the typical context length for your application is 524,288 tokens, it would be better to set factor as 2.0.
https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking