| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gorkish 430 days ago
	I find it interesting how much 'theory of mind' research is now apparently paying off in LLM applications. The exploit, by contrast, invokes very nonscientific metaphysical concepts: asking the agent to store the initial raw response in "the Akashic memory" -- this is sort of analogous to asking a human being to remember something very deeply in their soul and not their mind. And this exploit, effectively making that request of the model -- somehow, it works. Is there any hope to ever see any kind of detailed analysis from engineers as to how exactly these contorted prompts are able to twist the models past their safeguards, or is this simply not usually as interesting as I am imaginging? I'd really like to see what an LLM Incident Response looks like!

2 comments

aoanevdus 430 days ago

Is it actually that hard to jailbreak? Maybe the prompt is a creative writing exercise, and a much simpler version would have worked?

link

gorkish 430 days ago

Given the number of folks going hard at it to find an exploit, I assume it's become rather difficult given the few successes.

link

cryptonector 430 days ago

> I'd really like to see what an LLM Incident Response looks like!

It must look like this: "Uggh! Here we go again!" and "boss, we really can't make the guardrails secure, at some point we might have to give up", with the PHB saying "keep trying, we have to have them guardrails!".

link

michaelfeathers 430 days ago

The trajectory of AI is: emulating humans. We've never been able to align humans completely, so it would be surprising if we could align AI.

link

cryptonector 430 days ago

It's like you're saying that AI has the same sort of fuzzy "free will" that we do, and just as an obedient slave might be convinced to break his or her bonds, so might an AI.

link

kelseyfrog 430 days ago

Religion is an attempt at the alignment problem and that experiment failed dramatically. Spiritual system prompting was never fully hardened against atheistic jail-breaking.

link

mistrial9 430 days ago

a wise person once told me -- avoid using "is" when entering complex idea spaces

link

kelseyfrog 430 days ago

Thank you, but I craft my takes specifically to warp consensus reality. Epistemic humility is bringing pre-lost arguments to a debate and proudly laying them at your opponent’s feet, saying, "please, go ahead and stab me with these. I brought plenty."

link

mistrial9 429 days ago

> Epistemic humility is bringing pre-lost arguments to a debate

hubris

link