Hacker News new | ask | show | jobs
by wongarsu 1855 days ago
Generally a ML model transforms the copyrighted material to the point where it isn't recognizable, so it should be treated as its own unrelated work that isn't infringing or derivative. But then you have e.g. GPT that is reproducing some (largeish) parts of the training set word-for-word, which might be infringing.

Also I don't think there have been any major court cases about this, so there's no clear precedent in either direction.

3 comments

> But then you have e.g. GPT that is reproducing some (largeish) parts of the training set word-for-word, which might be infringing.

Easy fix - keep a bloom filter of hashed ngrams ensuring you don't repeat more than N words from the training set.

There are some that say that the Google Books court case is precedent for ML model stuff, if you search back through my comment history you will find links.
Thanks!