|
|
|
|
|
by zacmps
1092 days ago
|
|
But the original content is frequently recoverable. You can't just take copyrighted code, base 64 it, sent it to someone, have them decode it, and claim there was no copyright violation. From my (admittedly vague) understanding copyright law cares about the lineage of data, and I don't see how any reasonable interpretation could consider that the lineage doesn't pass through models. IANAL |
|
What if we train the model on paraphrases of the copyrighted code? The model can't reproduce exactly what it has not seen.
Also consider the size ratio - 1TB of code+text ends up into 1GB of model weights. There is no space to "memorize" the training set, it can only learn basic principles and how to combine them to generate code on demand.
The copyright law in principle should only protect expression, not ideas. As long as the model learns the underlying principles without copying the superficial form, it should be ok. That's my 2c