| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by agent_invariant 105 days ago

I've been approaching this from a slightly different angle: treating the problem less as "agent alignment" and more as an execution boundary problem.

Instead of trying to force the model to behave via prompts or policies, we assume the model will eventually propose something unsafe. The trick is making sure it can't commit irreversible actions directly.

So the pattern we've been experimenting with is:

agent proposes an action

proposal goes through a deterministic gate

gate checks things like replay, state advancement, spend ceilings, etc.

only then does the real-world action execute

In practice this looks more like a transaction firewall than a prompt guardrail.

The LLM can reason however it wants, but anything that changes real state (payments, DB writes, API calls) has to pass through the gate.

It doesn't solve the reasoning problem, but it makes the commit boundary deterministic, which removes a lot of the scary failure modes like duplicate actions or retries gone wild.

Still early experiments, but the model behaving badly becomes much less dangerous if it literally can't execute without passing the boundary.