| HN Mirror

Then we should encourage labeled ChatGPT content like ShareGPT, which can be easily avoided in future datasets because it is clearly labeled as AI-generated content.

It's the stuff that isn't labeled as generated with ChatGPT, et al, that will enter future training sets. I personally believe that's taking the "lossy JPEG" analogy too far, but I'm not an AI researcher.