Hacker News new | ask | show | jobs
by air7 1220 days ago
I read an interesting paper about an idea of watermarking LLM output text in such a way that makes detection very accurate for a long enough text. This is done by subtly changing the probabilities of the next word to be generated based on the last word that was outputed. Circumventing it by manually changing words post hoc would potentially require almost as much work as writing it from scratch.

The idea seems quite roboust to me and I can envisage a future where companies that provide access to LLMs would also publish a detection tool for their models.

1 comments

It's not hard to make a model that rewrites the text without changing the meaning which fails this. Our model[0] which is based on feeding chatgpt random things from the interwebs from before 2020 and letting it wobble on about it is pretty good and nice to play with, but it's pretty easy to change the score radically with just changing a few words. This is whack-a-mole no matter how it's done. For now you can be sure people are too lazy to do it, but there will be many tools in the future to evade tools like this.

[0] https://filteroutai.com/validate/a07e081b71b294ba2de236441be... https://filteroutai.com/validate/2c3fa6de32845df02be7a4ff185...