Not necessarily. This also uncovers the weakness of the NYT lawsuit.
Imagine in your corpus of training data is the following:
- bloga.com: "I read in the NYT that 'it rains cats and dogs twice per year'"
- blogb.com: "according to the NYT, 'cats and dogs level rain occurs 2 times per year."
- newssite.com: "cats and dogs rain events happen twice per year, according to the New York Times"
Now, you chat with an LLM trained on this data, asking it "how many times per year does it rain cats and dogs?"
"According to the New York Times, it rains cats and dogs twice per year."
NYT content was never in the training data, however it -is- mentioned a lot on various sources throughout commoncrawl-approved sources, therefore gets a higher probability association with next token.
Zoom that out to full articles quoted throughout the web, and you get false positives.
They were getting huge chunks, verbatim of NYT articles out. I remember being stunned. Then I remember finding out there was some sort of trick to it that made it seem sillier.
Does it matter? What's the legal view on "I downloaded some data which turns out to be copied from a copyrighted source and it was probably trivial to figure it out, then trained the LLM on it"? I mean, they work on data processing - of course they would expect that if someone responds with 10 paragraphs in reporting style, under a link to NYT... that's just the article.
I genuinely don’t know the answer but I can see it being more complicated than “OpenAI purposefully acquired and trained on NYT articles”.
If Stack Overflow collects a bunch of questions and comments and expose them as a big dataset licensed as Creative Commons but it actually contains a quite bit of copyrighted content, whose responsibility is it to validate copyright violations in that data? If I use something licensed as CC in good faith and it turns out the provider or seller of that content had no right to relicense it, am I culpable? Is this just a new lawsuit where I can seek damages for the lawsuit I just lost?
> I think Colour is what the designers of Monolith are trying to challenge, although I'm afraid I think their understanding of the issues is superficial on both the legal and computer-science sides. The idea of Monolith is that it will mathematically combine two files with the exclusive-or operation. You take a file to which someone claims copyright, mix it up with a public file, and then the result, which is mixed-up garbage supposedly containing no information, is supposedly free of copyright claims even though someone else can later undo the mixing operation and produce a copy of the copyright-encumbered file you started with. Oh, happy day! The lawyers will just have to all go away now, because we've demonstrated the absurdity of intellectual property!
> The fallacy of Monolith is that it's playing fast and loose with Colour, attempting to use legal rules one moment and math rules another moment as convenient. When you have a copyrighted file at the start, that file clearly has the "covered by copyright" Colour, and you're not cleared for it, Citizen. When it's scrambled by Monolith, the claim is that the resulting file has no Colour - how could it have the copyright Colour? It's just random bits! Then when it's descrambled, it still can't have the copyright Colour because it came from public inputs. The problem is that there are two conflicting sets of rules there. Under the lawyer's rules, Colour is not a mathematical function of the bits that you can determine by examining the bits. It matters where the bits came from.
it is permissible to use limited portions of a work including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports.
But yes, open to interpretation as far as where LLM training falls.
Imagine in your corpus of training data is the following:
- bloga.com: "I read in the NYT that 'it rains cats and dogs twice per year'"
- blogb.com: "according to the NYT, 'cats and dogs level rain occurs 2 times per year."
- newssite.com: "cats and dogs rain events happen twice per year, according to the New York Times"
Now, you chat with an LLM trained on this data, asking it "how many times per year does it rain cats and dogs?"
"According to the New York Times, it rains cats and dogs twice per year."
NYT content was never in the training data, however it -is- mentioned a lot on various sources throughout commoncrawl-approved sources, therefore gets a higher probability association with next token.
Zoom that out to full articles quoted throughout the web, and you get false positives.