Hacker News new | ask | show | jobs
by gorkish 430 days ago
I find it interesting how much 'theory of mind' research is now apparently paying off in LLM applications. The exploit, by contrast, invokes very nonscientific metaphysical concepts: asking the agent to store the initial raw response in "the Akashic memory" -- this is sort of analogous to asking a human being to remember something very deeply in their soul and not their mind. And this exploit, effectively making that request of the model -- somehow, it works.

Is there any hope to ever see any kind of detailed analysis from engineers as to how exactly these contorted prompts are able to twist the models past their safeguards, or is this simply not usually as interesting as I am imaginging? I'd really like to see what an LLM Incident Response looks like!

2 comments

Is it actually that hard to jailbreak? Maybe the prompt is a creative writing exercise, and a much simpler version would have worked?
Given the number of folks going hard at it to find an exploit, I assume it's become rather difficult given the few successes.
> I'd really like to see what an LLM Incident Response looks like!

It must look like this: "Uggh! Here we go again!" and "boss, we really can't make the guardrails secure, at some point we might have to give up", with the PHB saying "keep trying, we have to have them guardrails!".

The trajectory of AI is: emulating humans. We've never been able to align humans completely, so it would be surprising if we could align AI.
It's like you're saying that AI has the same sort of fuzzy "free will" that we do, and just as an obedient slave might be convinced to break his or her bonds, so might an AI.
Religion is an attempt at the alignment problem and that experiment failed dramatically. Spiritual system prompting was never fully hardened against atheistic jail-breaking.
a wise person once told me -- avoid using "is" when entering complex idea spaces
Thank you, but I craft my takes specifically to warp consensus reality. Epistemic humility is bringing pre-lost arguments to a debate and proudly laying them at your opponent’s feet, saying, "please, go ahead and stab me with these. I brought plenty."
> Epistemic humility is bringing pre-lost arguments to a debate

hubris