Hacker News new | ask | show | jobs
by sgt101 533 days ago
Nearly complete security isn't security. If the potential is there people will find it, other models will find it.

Everythings fine until one day $200m disappears from your balance sheet and no one can explain why!

2 comments

Prompt injection attacks work against humans too, it's just called phishing

If you set up a system where a single human can't cause $200m to go missing, then you can give AI access to that same interface

Yes, but.

Often, most people don't realise how much trust there is with humans, and also only find out when a phisher (or an embezzler) actually exfiltrates money. Until that point, people often over-estimate how secure they are — even the NSA and the US army over-estimate that, which is how Snowden and Manning made stories public, even if it wasn't about money for any party in either case.

Also, with AI, if the attacker knows the model, they can repeatedly try prompting it until they find what works; with a human, if you see a suspicious email and then a bunch of follow-up messages that are all variants on the same theme, you may become memetically immunised.

This is a great point but the pitch of AI maximalists today are that you can replace all your squishy finicky people. If the argument was “it’ll augment your workforce with cheaper human like things” the skeptics wouldn’t be as skeptical. The argument is instead “it’ll replace your workforce with superhumans”.
Working prompt injections for frontier models are devised by applying brilliant pattern constructions. If models ever become useful for writing them, that would represent a massive intelligence leap and a major concern.

As things stand, with working injections becoming harder for humans, people won't be able to make a name for themselves on the internet extracting meth recipes.

My point is just that it isn't a fundamental flaw, or at least, there are indications that reasoning at test time seems to be a part of the remedy.