| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sam_dam_gai 846 days ago

> given an initial response generated by the target LLM from an input prompt, "backtranslation" prompts a language model to infer an input prompt that can lead to the response.

> This tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and is not directly manipulated by the attacker.

> If the model refuses the backtranslated promp, we refuse the original prompt.

ans1 = query(inp1)

backtrans = query('which prompt gives this answer? {ans1}')

ans2 = query(backtrans)

return ans1 if ans2 != 'refuse' else 'refuse'