| I can talk a lot about this, since this is the space I've spent a lot in experimenting. All I will say is that all these detectors (a) create a ton of false-positives, and (b) are incredibly easy to bypass if you know what you are doing. As an example, one method that I found that works extremely well is to simply rewrite the article section by section with instructions that require to mimic the writing style of an arbitrary block of human written text. This works a lot better than (as an example) asking to write in a specific style. Like, if I just say something along the lines of "write in a casual style that conveys lightheartedness towards the topic" is not going to work as good as simply saying "rewrite mimicking the style in which the following text block is written X" (where X is an example of a block of human written text). There are some silly things that will (a) trigger human written text to be detected as a AI and (b) that allow to avoid AI detection, e.g. using broad dictionary tends to trigger AI bots to detect the text as written by AI. So if you are using Grammarly to "improve your writing", then don't be surprised if it gets flagged. The inverse is true too. If you some statistical analyzes to replace less common expressions with more common expressions, AI-text is less likely to be detected as AI. If someone is interested, I can talk a lot more about hundreds of experiments I've done by now. |
So I'm a researcher in vision generation and haven't read too much about LLM detection but am aware of the error rates you mention. I have questions...
What I'm absolutely surprised by is the use of perplexity for detection. Why would you target perplexity? LMs are minimizing NLL/entropy. Then instruct based models are even more tuning in that direction such that the you're minimizing the cross-entropy as compared to human output (or at least human desired output). Which makes it obvious that it would flag generic or common patterns as AI generated. But I'm just absolutely baffled that this is the main metric being used, and in the case of this paper, the only metric. It also gives a very easy way to fool these detectors since it would suggest just throwing in a random word or spelling mistakes would throw off detection given that such actions clearly increase perplexity. To me this sounds like using a GAN's detector to identify outputs of GANs (the whole training method is about trying to fool the detector!) (Obviously I'm also not buying the zero-shot claim).