Hacker News new | ask | show | jobs
by kijin 17 days ago
So we still don't have a reliable way to separate instructions from data when talking to an LLM, a problem that humans learned how to solve decades ago in areas like SQL and memory safety. But hey, we have these hopefully-not-leaky containers, which are probably implemented with just more system prompts.

How long until somebody figures out how to trick Codex into disabling Lockdown Mode for you?

2 comments

> So we still don't have a reliable way to separate instructions from data when talking to an LLM

Humans also do not know how to do this reliably, which is why phishing is still a thing and always will be.

I think the Stroop effect ("read these colour names, each written in a different colour") is probably the purest demonstration of this. Humans are trivially prompt-injectable.
> Humans also do not know how to do this reliably

These are machines, not humans, so I don't understand the comparison. The point of tech advancement is that we eliminate entire classes of errors that humans make. You'd probably look at me funny if I wrote a production application that failed randomly in unexpected ways like corrupting data, opening security holes, etc. then explained it away with "well, humans do it too!"

It's an artificial intelligence, not a small deterministic shell script. Stop comparing it to one. It has both new capabilities and new classes of failure mode. Those new failure modes are more like human failure modes than traditional symbolic logic failures.

We need to get better at using them and building them by validating both the inputs and outputs of such systems in more sophisticated ways, but to act surprised and denounce them because they fail in different ways than more primitive systems misses the point.

They're stochastic by design. If we want deterministic results we must use deterministic validators in conjunction with the stochastic system. It's trivial, and one day security experts will look back on the time when people didn't in the same way we look back on 90's software that didn't validate user input at all.

We can seperate them but the $ value of an agent that does is much lower than one that doesn't.

As a pre LLM analogy imagine working at a bank with a whitelist firewall. You need to install a package but requires an IT ticket. Safer but slooooower.

Now not saying what the answer here is but that is the issue.

The answer may be more like industries that get safer through lessons (like aviation) rather than go for 100% safety out of the gate. Because both fast travel and AI agents are insanely useful.

what? Aviation safety is not designed to get safer through lessons? They literally try to ensure it is 100% safe out of the gate. The accidents that happen are usually statistical outliers and lead to loss of life.

That's what it means when they say aviation regulations are written in blood. Not that they just fling planes into the sky and be like "boy i hope we learn some new regulations from this". The number of airplane crashes would be astronomically larger if the 100% safety part was not embedded into the design process.

I think we agree? Unless my reading comp is off today.