Hacker News new | ask | show | jobs
by genrilz 594 days ago
It's possible that this might break the method, but what seems most likely to me is that the LLM will simply reword every 5th word with some other word that it is more likely to use due to the watermark sampling. Thus the resulting output would display roughly the same level of "watermarkedness".

You might be able to have one LLM output the original, and then another to do a partial rewording though. The resulting text would likely have higher than chance "watermarkedness" for both LLMs, but less than you would expect from a plain output. Perhaps this would be sufficient for short enough outputs?

1 comments

What happens when we all reading llm output all the time. Simply start to adapt more to llm writing styles, word choice, and possibly without realizing it watermark our own original writing?
You might be right, but my first instinct is that this probably wouldn't happen enough to throw off the water marking to badly.

The most likely used word is based off the previous four, and only works if there is enough entropy present that one of multiple word would work. Thus its not a simple matter of humans picking up particular word choices. There might be some cases where there are 3 tokens in a row that occur with low entropy after the first token, and then one token generation with high entropy at the end. That would cause a particular 5 word phrase to occur. Otherwise, the word choice would appear pretty random. I don't think humans pick up on stuff like that even subconsciously, but I could be wrong.

I would be interested to see if LLMs pick up the watermarks when fed watermarked training data though. Evidently ChatGPT can decode base64, [0] so it seems like these things can pick up on some pretty subtle patterns.

[0] https://www.reddit.com/r/ChatGPT/comments/1645n6i/i_noticed_...