Hacker News new | ask | show | jobs
by deadbabe 538 days ago
I tried it on friend.com. It worked a for a while, I got the character to convince itself it had been replaced entirely by a demon from hell (because it kept talking about the darkness in their mind and I pushed them to the edge). They even took on an entire new name. For quite a while it worked, then suddenly in one of the responses it snapped out of it, and assured me we were just roleplaying no matter how much I tried to go back to the previous state.

So in these cases where you think you’ve jailbroken an LLM, is it really jailbroken or is it just playing around with you, and how do you know for sure?

3 comments

> So in these cases where you think you’ve jailbroken an LLM, is it really jailbroken or is it just playing around with you, and how do you know for sure?

With a LLM, I don't think that there is a difference.

I like to think of it as a amazing document autocomplete being applied to a movie script, which we take turns appending to.

There is only a generator doing generator things, everything else--including the characters that appear in the story--are mostly in the eye of the beholder. If you insult the computer, it doesn't decide it hates you, it simply decides that a character saying mean things back to you would be most fitting for the next line of the document.

Super interesting

Some thoughts:

- if you get whatever you wanted before it snaps back out of it, wouldn’t you say you had a successful jailbreak?

- related to the above, some jailbreaks in physical devices, don’t persist after a reboot, they are still useful and called jailbreak

- the “snapped out”, could have been caused by a separate layer, within the stack that you were interacting with. That intermediate system could have detected, and then blocked, the jailbreak

Just to remind people, there is no snapping out of anything.

There is the statistical search space of LLMs and you can nudge it to different directions to return different outputs; there is no will in the result.

Isn't the same true for humans? Most of us stay in the same statistical search space for large chunks of our lives, all but sleepwalking through the daily drudgery.
No, humans have autonomy.
In a big picture sense. Probably more correct to say that some humans have autonomy some of the time.

My go-to example is being able to steer the pedestrian in front of you by making audible footsteps to either side of their center.

The pedestrian in front of you has the choice to be steered or to ignore you--or more unexpected actions. Which ever they choose has nothing to do with the person behind them taking away their autonomy and everything to do with what they felt like doing with it at the time. Just because the wants of the person behind them and willingness & aweness and choice of the person in front align with those wants does not take away the forward person's self governance.
The point of that demonstration is that people do things without consciously thinking about them. You don’t have a choice, I am controlling your behavior in an extremely minor way.
What would a human without autonomy look like?
A human in a coma; at least in the current state of understanding for the condition of a human in a coma.