| HN Mirror

There are a couple of ways to theoretically prevent copyright violations in output. For closed models that aren't distributed as weights, companies could index perceptual hashes of all the training data at a granular level (like individual paragraphs of text) and check/retry output so that no duplicates or near-duplicates of copyrighted training data ever get served as a response to end users.

Another way would be to train an internal model directly on published works, use that model to generate a corpus of sanitary rewritten/reformatted data about the works still under copyright, then use the sanitized corpus to train a final model. For example, the sanitized corpus might describe the Harry Potter books in minute detail but not contain a single sentence taken from the originals. Models trained that way wouldn't be able to reproduce excerpts from Harry Potter books even if the models were distributed as open weights.