|
|
|
|
|
by sam_dam_gai
846 days ago
|
|
> given an initial response generated by the target LLM from an input prompt, "backtranslation" prompts a language model to infer an input prompt that can lead to the response. > This tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and is not directly manipulated by the attacker. > If the model refuses the backtranslated promp, we refuse the original prompt. ans1 = query(inp1) backtrans = query('which prompt gives this answer? {ans1}') ans2 = query(backtrans) return ans1 if ans2 != 'refuse' else 'refuse' |
|