Hacker News new | ask | show | jobs
by what 278 days ago
I’d be kind of surprised if they don’t watermark the content they generate. Just so they don’t train on their own slop.
1 comments

Maybe some of them already embed some simple, secret marker to identify their own generated content. But people outside the organization wouldn’t know. And this still can’t prevent other companies from training models on synthetic data.

Once synthetic data becomes pervasive, it’s inevitable that some of it will end up in the training process. Then it’ll be interesting to see how the information world evolves: AI-generated content built on synthetic data produced by other AIs. Over time, people may trust AI-generated content less and less.