| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by shiandow 58 days ago

For a company that puts DO NOT FUCKING GUESS in their instructions they made a heck of a lot of assumptions

- assume tokens are scoped (despite this apparently not even being an existing feature?)

- assume an LLM didn't have access

- assume an LLM wouldn't do something destructive given the power

- assume backups were stored somewhere else (to anyone reading, if you don't know where they are, you're making the same assumption)

Also you should never give LLMs instructions that rely on metacognition. You can tell them not to guess but they have no internal monologue, they cannot know anything. They also cannot plan to do something destructive so telling then to ask first is pointless. A text completion will only have the information that they are writing something destructive afterwards.

1 comments

gwerbin 58 days ago

The thing that seems to bring up these extremely unlikely destructive token sequences and it totally seems to be letting agents just run for a long time. I wonder if some kind of weird subliminal chaos signal develops in the context when the AI repeatedly consumes its own output.

Personally I don't even let my agent run a single shell command without asking for approval. That's partly because I haven't set up a sandbox yet, but even with a sandbox there is a huge "hazard surface" to be mindful of.

I wonder if AI agent harnesses should have some kind of built-in safety measure where instead of simply compacting context and proceeding, they actually shut down the agent and restart it.

That said I also think even the most advanced agents generate code that I would never want to base a business on, so the whole thing seems ridiculous to me. This article has the same energy as losing money on NFTs.

mike_hearn 57 days ago

I don't think it's that. It's really all about context. Humans always have at least a bit of context so it's hard for us to imagine what it's like to have none at all. But the AI genuinely has none. And it's under (training) pressure to get the task done quickly, be a yes man, and so on.

Humans do make mistakes like these. I'm not sure where the fault really lies here. I can imagine a human under time pressure making the same error. It's maybe a goof in the safety design of railway. It shouldn't be possible to delete all your backups with a single API call using a normal token.