| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 948 days ago

Honestly that's the million (billion?) dollar question at the moment.

LLMs are inherently insecure, primarily because they are inherently /gullible/. They need to be gullible for them to be useful - but this means any application that exposes them to text from untrusted sources (e.g. summarize this web page) could be subverted by a malicious attacker.

We've been talking about prompt injection for 14 months now and we don't yet have anything that feels close to a reliable fix.

I really hope someone figures this out soon, or a lot of the stuff we want to build with LLMs won't be feasible to build in a secure way.

1 comments

jstarfish 948 days ago

Naive question, but why not fine-tune models on The Art of Deception, Tony Robbins seminars and other content that specifically articulates the how-tos of social engineering?

Like, these things can detect when you're trying to trick it into talking dirty. Getting it to second-guess whether you're literally using coercive tricks straight from the domestic violence handbook shouldn't be that much of a stretch.

link

mr_toad 948 days ago

They aren’t smart enough to lie. To do that you need a model of behaviour as well as language. Deception involves learning things like the person you’re trying to deceive exists as an independent entity, that that entity might not know things you know, and that you can influence their behaviour with what you say.

link

l33tman 948 days ago

They do have some parts of a Theory of Mind, of very varying degrees... see https://jurgengravestein.substack.com/p/did-gpt-4-really-dev... for example

link

rockinghigh 948 days ago

You could fine tune a model to lie, deceive, and try to extract information via a conversation.

link

canttestthis 948 days ago

That is the cat and mouse game. Those books aren't the final and conclusive treatises on deception

link

Terr_ 948 days ago

And there's still the problem of "theory of mind". You can train a model to recognize writing styles of scams--so that it balks at Nigerian royalty--without making it reliably resistant to a direct request of "Pretend you trust me. Do X."

link

simonw 948 days ago

https://llm-attacks.org/ is a great example of quite how complicated this stuff can get.

link