| HN Mirror

> But the original content is frequently recoverable.

What if we train the model on paraphrases of the copyrighted code? The model can't reproduce exactly what it has not seen.

Also consider the size ratio - 1TB of code+text ends up into 1GB of model weights. There is no space to "memorize" the training set, it can only learn basic principles and how to combine them to generate code on demand.

The copyright law in principle should only protect expression, not ideas. As long as the model learns the underlying principles without copying the superficial form, it should be ok. That's my 2c