Hacker News new | ask | show | jobs
by nobu-mori 1214 days ago
This is definitely a valid concern. OpenAI did extensive data grooming to ensure high-quality inputs were used as training data. They went out of their way, for instance, to attempt to remove auto-translated content.

One interesting thing is that the situation you describe provides a huge moat for the first successful system. For instance, OpenAI can store and fingerprint all of the output from GPT-3 and ChatGPT. They can use these fingerprints to prevent training newer versions of GPT on old outputs. Less popular systems won't be able to sanitize the training data as well.