Hacker News new | ask | show | jobs
by higheun 97 days ago
Interesting to see this formalized. I've been running controlled experiments on why context separation improves LLM review quality — something I'm calling Cross-Context Review (CCR).

Setup: 30 artifacts (code, docs, scripts), 150 injected errors, 4 review conditions, 360 total reviews using Claude Opus 4.6.

Results:

- Cross-Context Review (artifact only, no production history): F1 28.6%

- Same-session self-review: F1 24.6% (p=0.008 vs CCR)

- Same-session repeated review (SR2): F1 21.7%

The SR2 result is the key finding — reviewing twice in the same session doesn't help (p=0.11 vs single review). The model generates more noise, not more signal. This rules out "two looks are better than one" as an explanation. It's the context separation itself that matters.

The gap is widest on critical errors: 40% detection for CCR vs 29% for same-session review.

Mechanism: production context introduces anchoring bias + sycophancy + context rot. A fresh session eliminates all three simultaneously by removing the conditioning tokens.

What Anthropic is doing here — dispatching independent agents that never saw the production context — is essentially this principle at industrial scale. Working on a paper but not published yet.

1 comments

Update: the paper is now on arXiv — https://arxiv.org/abs/2603.12123

"Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions"