Hacker News new | ask | show | jobs
by silentsvn 100 days ago
> I’m sticking with humans for the moment Haha totally get this statement.

The HitL fine-tuning angle is exactly right. The labeled dataset you're building (good/bad/stylistically-wrong memory events) is probably worth more than the compaction itself. Coherence preferences are surprisingly personal — what reads as "not correct based on my style" is hard to spec without examples.

The loop-pruning maps really cleanly to the contradiction detection in our setup. A model circling the same state N times is often because it stored an inconclusive result with the same confidence as a resolved one they look identical at recall time. Tagging memory entries with a status [open, resolved, or contradicted] before they go in cuts a lot of that.

On the autonomy question: we ended up treating certainty as continuous rather than binary. Low-certainty memories stay soft, high-certainty ones get promoted. Automatic compaction only operates on the low end, higher certainty entries are off-limits without explicit override. That lets you keep the autonomy without the coherence risk. The failure mode shifts from "deleted something important" to "kept something stale too long," which feels more recoverable.

Would be curious what your pruning signal looks like at the turn level — are you scoring relevance per-turn retroactively, or flagging at write time?

1 comments

Semi-retroactively: my agent has a command to /compact and its then that I pop the interface. It gets opened automatically if the context is full, too, and then I've gone back and fed some recorded sessions into it as well days later too, to test things out. Still getting the hang of it, but I won't be surprised to see much bigger teams/companies do something similar (I assume they are already, really)
The /compact trigger is a clean pattern — agent-initiated but human-confirmed. Makes the interface feel more like a review than an interruption.

The retroactive feeding of recorded sessions is underrated. That's basically supervised compaction - you're labeling what mattered in hindsight, which is almost always cleaner signal than in-flight decisions.

I suspect the labs are doing something like this at scale but the hard part is that "what mattered" is user-specific. A generic compaction model trained on aggregate data probably smooths over the individual coherence preferences that make it actually useful.

We ended up open-sourcing the memory layer as an MCP server (engram-mcp) if youre interested at how we handled the certainty/recall side.

Interested in what your session recordings look like structurally or are they raw transcripts or do you extract structure before feeding them in?