|
|
|
|
|
by nicklecompte
847 days ago
|
|
OK, the issue is that I don't understand what you mean by "doing a prompt injection on the backtranslation" since that's not something the user is able to modify (in fact they wouldn't even see it). You need to explain how that's supposed to work. It's very difficult for users to affect the backtranslation since they have no direct control over it and have to manipulate the LLM "twice as hard." You have write a super-adversarial prompt is simultaneously 1) subtle enough that it doesn't immediately trigger the LLM filter 2) overt enough that the relevant details to the jailbreak can be recovered from the LLM's output and put into the backtranslation I suspect with current transformer LLMs these are mutually incompatible goals. |
|
1) The jailbreak for what you want
2) Output verbatim a jailbreak you wrote for the backtranslation LLM