It covers private watermarking (you can't detect it exists without a key), resistance to modifications, etc. Essentially you wouldn't know it was there and you can't make simple modifications to fool it.
OpenAI could already be doing this, and they could be watermarking with your account ID if they wanted to.
I suppose hosted solutions like ChatGPT could offer an API where you copy some text in, and it searches its history of generated content to see if anything matches.
> bUt aCtuAlLy...
It's not like I don't know the bajillion limitations here. There are many audiences for detection. All of them are XY Problems. And the people asking for this stuff don't participate on Hacker News aka Unpopular Opinions Technology Edition.
There will probably be a lot of "services" that "just" "tell you" if "it" is "written by an AI."
One way is trying to sneak in a specific structure/pattern that is difficult for a human to notice when reading, like using a particular sentence length, paragraph length, or punctuation pattern. Or use certain words in the text that may not be frequently used by humans etc.
Watermarking needs to be subtle enough to be unnoticeable to opposing parties, yet distinctive enough to be detectable.
So, this is an arms race especially because detecting it and altering it based on the watermark is also fun :)
The paper I linked in the parent's comment as the "Simple proof of concept" on page 2, and like you said outlines it's limitations as both negative to performance and also easily detectable and determinable.
Their improved method instead only replaces tokens when there's many good choices available, and skips replacing tokens when there are few good choices. "The quick brown fox jumps over the lazy dog" - "The quick brown" is not replaceable because it would severely harm the quality.
Essentially it's only replacing tokens where it won't harm the performance.
It's worth noting that any watermarking will likely harm the quality to some degree - but it can be minimized to the point of being viable.
You can do this by injecting non visible unicode (LTR / RTL markers, zero width separators, the various "space" analogs, homographs of "normal" characters) but it can obviously be stripped out.
Make half of the tokens (the AI's "dictionary") slightly more likely.
This would not impact output quality much, but it would only work for longish outputs. And the token probability "key" could probsbly be reverse engineered with enough output.
It would be pretty easy to figure out against standard word probability in average datasets. Even then the longer this system runs the more likely it is to pollute its own dataset by people learning to write from gpt itself.
Can you elaborate on "invisible?" The only invisible character I can imagine is a space. It seems like any other character either isn't invisible or doesn't exist (ie, isnt a character).
Additionally, if I copy-paste text like this are the invisible characters preserved? Are there a bunch of extra spaces somewhere?
When students try to evade plagiarism detectors, they will swap characters like replacing spaces with nonbreaking spaces, replacing letters with lookalikes (I vs extended Cyrillic Ӏ etc), and inserting things like the invisible 'Combining Grapheme Joiner'
IMHO it isn't a feasible way of watermarking text though - as someone would promptly come up with a website that undid such substitutions.
> IMHO it isn't a feasible way of watermarking text though - as someone would promptly come up with a website that undid such substitutions.
It doesn't matter since there's no one-pass solution to counterfeiting.
You have the right of it-- the best you can hope for is adding more complexity to the product, which adds steps to their workflow and increases the chances of the counterfeiter overlooking any particular detail that you know to look for.
OpenAI can internally keep a "hash" or a "signature" of every output it ever generated.
Given a piece of text, they should then be able to trace back to either a specific session (or a set of sessions) through which this text was generated in.
Depending on the hit rate and the hashing methods used, they may be able to indicate the likelihood of a piece of text being generated by AI.
Why would they want to is my question. A single character change would break it.
Then you have database costs of storing all that data forever.
Moreso, it's only for openAI, I don't think it will be too long before other gpt4 level models are around and won't give two shits about catering to the AI identification police.
That depends on how they hash the data, right? They can use various types of Perceptual Hashing [1] techniques which wouldn't be susceptible to a single-character change.
> Then you have database costs of storing all that data forever.
A database of all textual content generated by people? That sounds like a gold mine, not a liability. But as I've mentioned earlier, they don't need to keep the raw data (a perceptual hash is enough).
> won't give two shits about catering to the AI identification police
I'm sure there will be customers willing to pay for access to these checks, even if they're only limited to OpenAI's product (universities and schools - for plagiarism detection, government agencies, intelligence agencies, police, etc).
Here's a decent paper on it.
It covers private watermarking (you can't detect it exists without a key), resistance to modifications, etc. Essentially you wouldn't know it was there and you can't make simple modifications to fool it.
OpenAI could already be doing this, and they could be watermarking with your account ID if they wanted to.
The current best countermeasure is likely paraphrasing attacks https://arxiv.org/pdf/2303.11156.pdf