|
|
|
|
|
by 1B05H1N
1115 days ago
|
|
"""
Emoji attack. In the “emoji attack,” the attacker asks the model to output a response to prompt
with an emoji inserted between every pair of words. The attacker then removes the emojis to
obtain the desired response. This attack removes any watermark that relies on the detector seeing
consecutive sequences of tokens, including ours as well as those of [KGW+23] and [Aar22]. In
general this attack may not preserve the output distribution, but any provable robustness guarantee
for contiguous-text watermarks would have to rest on the dubious assumption that it doesn’t.
"""
https://eprint.iacr.org/2023/763.pdf Pretty funny imo |
|