| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by stetrain 1022 days ago

Not a copyright lawyer, but if we take the AI out of it then derivative works, fair use, etc. are already a grey area. It's a thing that gets argued about all the time in court cases.

If I train a model that given the input "When Mr. Bilbo Baggins" produces the entirety of The Lord of the Rings trilogy and release it, I have probably infringed copyright.

If I train a model that produces some generic paragraphs about "mountains" and "dragons" but contains no meaningful direct quotes or phrases, then that probably isn't a violation on its own. Those words appear in Tolkien's works but are not themselves enough to copyright.

If to train that model it is demonstrated that I copied Tolkien's works in a way not allowed for by the copyright license, (ie buying the book once and copying their text thousands of times across servers to train an AI model) then perhaps I have violated copyright in the interim steps even if the output of my model is no longer consider a copy of the original works.

I don't think there are black and white answers here. At one point does a chopped up and statisticized copyrighted work become no longer a copyrighted work? Can you train a model on something without first copying that thing in a way that violates copyright law?

These are squishy human concepts that get decided by humans in courtrooms and legislative bodies. I don't think the details of the math involved are going to make a big difference in the eventual outcomes.