|
|
|
|
|
by mg
221 days ago
|
|
If a model is not making use of the whole context window - shouldn't that be very noticeable when the prompt is code? For example when querying a model to refactor a piece of code - would that really work if it forgets about one part of the code while it refactors another part? I concatenate a lot of code files into a single prompt multiple times a day and ask LLMs to refactor them, implement features or review the code. So far, I never had the impression that filling the context window with a lot of code causes problems. I also use very long lists of instructions on code style on top of my prompts. And the LLMs seem to be able to follow all of them just fine. |
|
https://wandb.ai/byyoung3/ruler_eval/reports/How-to-evaluate...
>Gpt-5-mini records 0.87 overall judge accuracy at 4k [context] and falls to 0.59 at 128k.
And Llama 4 Scout claimed a 10 million token context window but in practice its performance on query tasks drops below 20% accuracy by 32k tokens.