| HN Mirror

> It's lossy compression, the same way a JPEG might be

Compression yes, but this is co-mingling as well. The entire corpus is compressed together, which identifies common patterns, and in the model they are essentially now overlapping.

The original document is represented statistically in the final model, but you’ve lost the ability to extract it closely. Instead you gain the ability to generate something statistically similar to a large number of original documents that are related or are structurally similar.

I’m just commenting, not disputing any argument about fair use.