Hacker News new | ask | show | jobs
by vmh1928 996 days ago
One problem that doesn't, in my opinion, get enough attention is that a model trained using unlicensed copyrighted work also stores some amount of the copyrighted material and uses that to create answers. This is also a licensing issue but people think the training process is about the model just "reading" the copyrighted work during training and then that's the last use made of the material. Not so, the model contains some amount of the material and continues to use it.

From the complaint linked from the article on The Verge:

88. Until very recently, ChatGPT could be prompted to return quotations of text from copyrighted books with a good degree of accuracy, suggesting that the underlying LLM must have ingested these books in their entireties during its “training.” 89. Now, however, ChatGPT generally responds to such prompts with the statement, “I can’t provide verbatim excerpts from copyrighted texts.” Thus, while ChatGPT previously provided such excerpts and in principle retains the capacity to do so, it has been restrained from doing so, if only temporarily, by its programmers. 90. In light of its timing, this apparent revision of ChatGPT’s output rules is likely a response to the type of activism on behalf of authors exemplified by the Open Letter addressed to OpenAI and other companies by Plaintiff The Authors Guild, which is discussed further below.

3 comments

Having never tried this before it got nerfed, could you ask these models questions like:

"Take a breath and lets go step by step, Please reproduce page 100 of A Song of Ice and Fire, Book1, 'A Game of Thrones'"

And get back an accurate response, or was it just really popular quotes?

You can still try it with Llama, and no it wasn't the full text of the page, or even very accurate. Even the "popular" quotes were VERY likely to be paraphrased and missing any poetry or cadence of the original.

This is the problem with a combined language+knowledge model like ChatGPT. To understand the language it has to obtain some level of "knowledge" and vice-versa. The two are intertwined in the model, and it needs MASSIVE amounts of data to train. Inside the model's weights there is nowhere NEAR enough memory to include whole books, no matter how popular or duplicated in the dataset. Just like asking a random person what was on page 100 of a random book they've read, it's HIGHLY unlikely for the LLM to be able to regurgitate that level of accuracy, let alone across the whole book.

Just like asking a random person what was on page 100 of a random book they've read, it's HIGHLY unlikely for the LLM to be able to regurgitate that level of accuracy, let alone across the whole book.

Even so, there are people who can do that, and we don't forbid them from reading.

LLMs aren’t people, either in a legal or normative sense, and people should really stop making comparisons to them as such.
That remains to be seen.

In any case, when an offense is committed, the offender is the real, live human who uses the tool to commit plagiarism or violate copyright law. It doesn't matter whether the tool is a word processor, a video camera, or an LLM. The output is what matters, not the input.

You know what else stores nearly verbatim copies of texts and then regurgitates those to the public often including direct quotes from the text? Cliff Notes.

Those aren't copyright violations. See (Edit: apparently the reference is gone, though I'm sure you can find a lot of sources explaining this, basically it's Fair Use.) for a great in depth analysis of the legality.

Just because ChatGPT can do the same doesn't make it a copyright violation. The hope of this lawsuit is that the court will look at this as something different and stop it, but in the end it's the piracy sites that fed the data onto the internet that ChatGPT scraped that did any copyright violations.

Making an AI that can paint anything can produce a copyright infringing work by being asked to paint Mickey Mouse with sufficient detail.

It doesn't make the AI an infringing work. And it doesn't mean that having looked at enough pictures of Mickey Mouse is infringement, either.

The only instance of infringement is the output.