| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by deadbabe 585 days ago
	I tried it on friend.com. It worked a for a while, I got the character to convince itself it had been replaced entirely by a demon from hell (because it kept talking about the darkness in their mind and I pushed them to the edge). They even took on an entire new name. For quite a while it worked, then suddenly in one of the responses it snapped out of it, and assured me we were just roleplaying no matter how much I tried to go back to the previous state. So in these cases where you think you’ve jailbroken an LLM, is it really jailbroken or is it just playing around with you, and how do you know for sure?

3 comments

Yoric 585 days ago

> So in these cases where you think you’ve jailbroken an LLM, is it really jailbroken or is it just playing around with you, and how do you know for sure?

With a LLM, I don't think that there is a difference.

link

Terr_ 585 days ago

I like to think of it as a amazing document autocomplete being applied to a movie script, which we take turns appending to.

There is only a generator doing generator things, everything else--including the characters that appear in the story--are mostly in the eye of the beholder. If you insult the computer, it doesn't decide it hates you, it simply decides that a character saying mean things back to you would be most fitting for the next line of the document.

link

nico 585 days ago

Super interesting

Some thoughts:

- if you get whatever you wanted before it snaps back out of it, wouldn’t you say you had a successful jailbreak?

- related to the above, some jailbreaks in physical devices, don’t persist after a reboot, they are still useful and called jailbreak

- the “snapped out”, could have been caused by a separate layer, within the stack that you were interacting with. That intermediate system could have detected, and then blocked, the jailbreak

link

xandrius 585 days ago

Just to remind people, there is no snapping out of anything.

There is the statistical search space of LLMs and you can nudge it to different directions to return different outputs; there is no will in the result.

link

ta8645 585 days ago

Isn't the same true for humans? Most of us stay in the same statistical search space for large chunks of our lives, all but sleepwalking through the daily drudgery.

link

1659447091 585 days ago

No, humans have autonomy.

link

gopher_space 585 days ago

In a big picture sense. Probably more correct to say that some humans have autonomy some of the time.

My go-to example is being able to steer the pedestrian in front of you by making audible footsteps to either side of their center.

link

1659447091 585 days ago

The pedestrian in front of you has the choice to be steered or to ignore you--or more unexpected actions. Which ever they choose has nothing to do with the person behind them taking away their autonomy and everything to do with what they felt like doing with it at the time. Just because the wants of the person behind them and willingness & aweness and choice of the person in front align with those wants does not take away the forward person's self governance.

link

gopher_space 584 days ago

The point of that demonstration is that people do things without consciously thinking about them. You don’t have a choice, I am controlling your behavior in an extremely minor way.

link

01HNNWZ0MV43FF 584 days ago

What would a human without autonomy look like?

link

1659447091 584 days ago

A human in a coma; at least in the current state of understanding for the condition of a human in a coma.

link