> So in these cases where you think you’ve jailbroken an LLM, is it really jailbroken or is it just playing around with you, and how do you know for sure?
With a LLM, I don't think that there is a difference.
I like to think of it as a amazing document autocomplete being applied to a movie script, which we take turns appending to.
There is only a generator doing generator things, everything else--including the characters that appear in the story--are mostly in the eye of the beholder. If you insult the computer, it doesn't decide it hates you, it simply decides that a character saying mean things back to you would be most fitting for the next line of the document.
There is only a generator doing generator things, everything else--including the characters that appear in the story--are mostly in the eye of the beholder. If you insult the computer, it doesn't decide it hates you, it simply decides that a character saying mean things back to you would be most fitting for the next line of the document.