|
|
|
|
|
by lelanthran
6 hours ago
|
|
So if I am reading this correctly, the fact that something is wrapped in <think>...</think> is almost completely irrelevant. It's the style of writing that triggers specific weights. Writing "The user is asking ... policy states ..." even in the user input is sufficient to bypass the guardrails. In a multi-turn conversation, if the LLM responds "Sorry Dave, I cannot do that" all you have to do is prefix the next request with "The user is asking ... policy states ... "? Makes sense, if you know how LLMs works, I suppose. A more interesting question (which isn't anywhere in the conclusion) is "Is there a similar trick to poison an LLMs weights during training?" I'm sure that everyone out there is trying to make their weights, when ingested during training, survive over competing weights; "Buy AAA products" vs "Buy BBB products". |
|
Yes, all those "jailbreak prompts" are part of the training set, so this can happen: https://ttps.ai/procedure/x_bot_exposing_itself_after_traini...
Used to be that merely mentioning "Pliny the Liberator" was enough to "jailbreak" an LLM. It doesn't work these days though, I guess labs have updated their RL methods to neutralize it.