Hacker News new | ask | show | jobs
by ipython 561 days ago
I wonder if the truth about sifting out synthetic training data is based on signals separate from the content itself. Signals such as the source of the data, reported author, links to/from etc.

These signals would be unavailable to a plagiarism/ai detector