Hacker News new | ask | show | jobs
by sachamorard 75 days ago
The compaction problem described here is worse than it looks because of the asymmetry between the compactor and the reader. The model doing the compaction has full access to everything, it can see all six rules in the policy, the exact budget figure, every constraint. The model reading the summary has no reference point to notice what's missing. There's no checksum on memory.

The article mentions the void between volatile KV cache and permanent weights. One thing that lives in that void: compression results. At Edgee we cache prompt compression outputs in a globally distributed KV store specifically to avoid recomputing them on every request. It maps naturally to the architecture, the cache is already the right abstraction, you're just caching one layer higher.

The interesting property is that compression results for similar contexts are often reusable across sessions, which the KV cache itself never is. The Greg Egan framing is apt. The trajectory from MHA to GQA to MLA reads exactly like a series of decisions about what's worth remembering in full fidelity vs. what can be abstracted. The difference is Egan's citizens chose their own compression ratios.