|
|
|
|
|
by FloorEgg
529 days ago
|
|
I really don't think it's that simple. I can read books and then earn money from applying what I learned in them. I can also study art and then make original art in the same or similar styles. If a person was doing this there would be no one claiming copyright infringement. The only difference is it's a machine doing it and not a person. The nature of copyright and plagiarism boils down to paraphrasing, and so long as LLMs sufficiently paraphrase the content it's an open question whether it's copyright infringement and requires new law/precedent. So the fact they are earning money is a red herring unless they are reproducing the exact same content without paraphrasing (with exception to commentary). E.g. they can quote part of a work while commenting on it. Where they have gotten into trouble with e.g. NYT afaik is when the LLM reproduced a whole article word for word. I think they have all tried hard to prevent the LLM from ever doing that to avoid that legal risk. |
|
How many books can you read, understand and memorize in T time, and how many books an AI can ingest in the T time?
If we're down to paraphrasing, watch this video [1], and think again.
Many models, given that you ask the correct questions, reproduce their training set with great accuracy, and this is only prevented with monkey patching, IIUC.
So, it's still a big mess, even if we don't add copyrighted corpus to the mix. Oh, BTW, datasets like "The Stack" are not clean as they claim. I have seen at least two non-permissively licensed code repositories inside that dataset.
[1]: https://youtu.be/LrkAORPiaEA