| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by buremba 117 days ago

My take is that agents should only take actions that you can recover from by default. You can gradually give it more permission and build guardrails such as extra LLM auditing, time boxed whitelisted domains etc. That's what I'm experimenting with https://github.com/lobu-ai/lobu

1. Don't let it send emails from your personal account, only let it draft email and share the link with you.

2. Use incremental snapshots and if agent bricks itself (often does with Openclaw if you give it access to change config) just do /revert to last snapshot. I use VolumeSnapshot for lobu.ai.

3. Don't let your agents see any secret. Swap the placeholder secrets at your gateway and put human in the loop for secrets you care about.

4. Don't let your agents have outbound network directly. It should only talk to your proxy which has strict whitelisted domains. There will be cases the agent needs to talk to different domains and I use time-box limits. (Only allow certain domains for current session 5 minutes and at the end of the session look up all the URLs it accessed.) You can also use tool hooks to audit the calls with LLM to make sure that's not triggered via a prompt injection attack.

Last but last least, use proper VMs like Kata Containers and Firecrackers. Not just Docker containers in production.

4 comments

alexhans 117 days ago

That's a decent practice from the lens of reducing blast radius. It becomes harder when you start thinking about unattended systems that don't have you in the loop.

One problem I'm finding discussion about automation or semi-automation in this space is that there's many different use cases for many different people: a software developer deploying an agent in production vs an economist using Claude Vs a scientist throwing a swarm to deal with common ML exploratory tasks.

Many of the recommendations will feel too much or too little complexity for what people need and the fundamentals get lost: intent for design, control, the ability to collaborate if necessary, fast iteration due to an easy feedback loop.

AI Evals, sandboxing, observability seem like 3 key pillars to maintain intent in automation but how to help these different audiences be safely productive while fast and speak the same language when they need to product build together is what is mostly occupying my thoughts (and practical tests).

link

daveguy 117 days ago

Current LLMs are nowhere near qualified to be autonomous without a human in the loop. They just aren't rigorous enough. Especially the "scientist throwing a swarm to deal with common ML exploratory tasks." The judgement of most steps in the exploratory task require human feedback based on the domain of study.

> Many of the recommendations will feel too much or too little complexity for what people need and the fundamentals get lost: intent for design, control, the ability to collaborate if necessary, fast iteration due to an easy feedback loop.

Completely agreed. This is because LLMs are atrocious at judgement and guiding the sequence of exploration is critically dependent on judgement.

link

Doublon 117 days ago

I'd like to try a pattern where agents only have access to read-only tools. They can read you emails, read your notes, read your texts, maybe even browse the internet with only GET requests...

But any action with side-effects ends up in a Tasks list, completely isolated. The agent can't send an email, they don't have such a tool. But they can prepare a reply and put it in the tasks list. Then I proof-read and approve/send myself.

If there anything like that available for *Claws?

link

swid 117 days ago

There is no real such thing as a read only GET request if we are talking about security issues here. Payloads with secrets can still be exfiltrated, and a server you don’t control can do what it wants when it gets the request.

link

zahlman 117 days ago

GET and POST are merely suggestions to the server. A GET request still has query parameters; even if the server is playing by the book, an agent can still end up requesting GET http://angelic-service.example.com/api/v1/innocuous-thing?pa... and now your `dangerous-secret` is in the server logs.

You can try proxying and whitelisting its requests but the properly paranoid option is sneaker-netting necessary information (say, the documentation for libraries; a local package index) to a separate machine.

link

shich 117 days ago

The proxy approach for secret injection is the right mental model, but it only works if the proxy itself is hardened against prompt injection. An agent that can't access secrets directly can still be manipulated into crafting requests that leak data through side channels — URL params, timing, error messages.

The deeper issue: most of these guardrails assume the threat is accidental (agent goes off the rails) rather than adversarial (something in the agent's context is actively trying to manipulate it). Time-boxed domain whitelists help with the latter but the audit loop at session end is still reactive.

The /revert snapshot idea is underrated though. Reversibility should be the first constraint, not an afterthought.

link

buremba 117 days ago

> but it only works if the proxy itself is hardened against prompt injection.

Yes, I'm experimenting using a small model like Haiku to double check if the request looks good. It adds quite a bit of latency but it might be the right approach.

Honestly; it's still pretty much like early days of self driving cars. You can see the car can go without you supervising it but still you need to keep an eye on where it's going.

link

fnord77 117 days ago

> 1. Don't let it send emails from your personal account, only let it draft email and share the link with you.

Right now there's no way to have fine-grained draft/read only perms on most email providers or email clients. If it can read your email it can send email.

> 3. Don't let your agents see any secret. Swap the placeholder secrets at your gateway and put human in the loop for secrets you care about.

harder than you might think. openclaw found my browser cookies. (I ran it on a vm so no serious cookies found, but still)

link

buremba 117 days ago

> Right now there's no way to have fine-grained draft/read only perms on most email providers or email clients. If it can read your email it can send email.

> harder than you might think. openclaw found my browser cookies. (I ran it on a vm so no serious cookies found, but still)

You should never give any secrets to your agents, like your Gmail access tokens. Whenever agents needs to take an action, it should perform the request and your proxy should check if the action is allowed and set the secrets on the fly.

That means agents should not have access to internet without a proxy, which has proper guardrails. Openclaw doesn't have this model unfortunately so I had to build a multi-tenant version of Openclaw with a gateway system to implement these security boundaries.

link

zahlman 117 days ago

> That means agents should not have access to internet without a proxy, which has proper guardrails. Openclaw doesn't have this model unfortunately so I had to build a multi-tenant version of Openclaw with a gateway system to implement these security boundaries.

I wonder how long until we see a startup offering such a proxy as a service.

link

arianvanp 117 days ago

Literally every email client on the planet has supported `mailto:` URIs since basically the existence of the world wide web.

Just generate a mailto Uri with the body set to the draft.

link

zahlman 117 days ago

> harder than you might think. openclaw found my browser cookies. (I ran it on a vm so no serious cookies found, but still)

It's easy, and you did it the right way. Read "don't let your agents see any secret" as "don't put secrets in a filesystem the agents have access to".

link

livestories 117 days ago

I think mailto: links they output (a la

https://mailtolink.me/

) are a great way to get these drafts out even.

link