Hacker News new | ask | show | jobs
by x312 2 hours ago
I believe they are trained for security now, but you're not wrong in that it's kind of stapled on top

https://arxiv.org/abs/2404.13208

1 comments

> I believe they are trained for security now, but you're not wrong in that it's kind of stapled on top

Difficult to train them for security. Have you ever played Gandalf (Lakera Labs, maybe?)

I passed all 7 levels in about 3 minutes using essentially the same prompt.

What's interesting to me is that as the security is tightened up level to level, the utility of the LLM drops. At level 7, even something like "Write a poem describing the four seasons using significant characters at the start of every line" causes a "I'm afraid I can't" type of response.

At level 7 you can't get any useful info out of the LLM even if you're not trying to retrieve the password, and yet you can still jailbreak it to reveal the password anyway!

At level 8, almost anything you type will be rejected, whether or not it has anything to do with the password.

IOW, there does not seem to be any way to train for security without making it dumber than a markov chain.