No single paper nails that exact claim. SWE-bench Princeton does show that models struggle significantly with real-world issues requiring changes across multiple files and functions which points in that direction. But the local vs global framing is mostly practitioner-observed, not a formally tested hypothesis yet. Fair point, I should have hedged it. https://arxiv.org/abs/2310.06770