Hacker News new | ask | show | jobs
by cubefox 8 days ago
The problem with "context rot" is that its existence and severity is purely anecdotal. As far as I know, nobody has actually measured context rot systematically. The only thing we know is that memory degrades somewhat in long contexts, via things like needle in haystack tests. But that's not the same issue. Context rot is usually taken to mean that the model gets dumber even if it doesn't need to remember specific things in its context window.

This would be really easy to measure. Just take some standard benchmarks, but fill up the context beforehand. Is the benchmark performance degraded? If so, by how much?

1 comments

It's pretty hard to measure because most context rot comes from related context and the model has to be able to figure which parts are truly relevant, which ones are relevant but stale, which ones to ignore etc.

Each relevant thing is basically a rule. Trying to so something with 500 rules is what's hard.

If you take a standard benchmark and just prepend a random book to it, it will not capture that

Would be still interesting whether it degraded the performance in that case. Further, many non-agentic benchmarks consist of many short tasks, so one could fill the context with task/response pairs from other tasks (like in a standard chat environment) and then ask the current task at the end. Given that the tasks are probably somewhat similar, context rot should occur.