Hacker News new | ask | show | jobs
by 1B05H1N 1115 days ago
""" Emoji attack. In the “emoji attack,” the attacker asks the model to output a response to prompt with an emoji inserted between every pair of words. The attacker then removes the emojis to obtain the desired response. This attack removes any watermark that relies on the detector seeing consecutive sequences of tokens, including ours as well as those of [KGW+23] and [Aar22]. In general this attack may not preserve the output distribution, but any provable robustness guarantee for contiguous-text watermarks would have to rest on the dubious assumption that it doesn’t. """ https://eprint.iacr.org/2023/763.pdf

Pretty funny imo