| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fasterik 1123 days ago
	Was GPT-4 trained on data that was acquired illegally? Or was it trained on data acquired legally that OpenAI didn't have the rights to redistribute? There is a difference. In the latter case, whether it counts as "stealing" would come down to whether or not GPT-4 counts as a derivative work, or some similar legal concept.

1 comments

svaha1728 1123 days ago

https://www.washingtonpost.com/technology/interactive/2023/a...

Scribd has lots of pdfs of books that are copyrighted. The Washington Post article mentions there are several other places it downloaded and scraped pdfs of copyrighted textbooks, etc

link

fasterik 1123 days ago

That's interesting to know, but that doesn't by itself imply that it's illegal. For example, Google Books, which has massive amounts of scanned PDFs of copyrighted works, is considered fair use under US copyright law.

link

cyanydeez 1122 days ago

There's no good faith world where OPENAI trained only on legally available works

The only valid arguments is whether their model or it's output is itself protected legally.

link

still_grokking 1123 days ago

As long as you don't try to scrape all the book's content…

It's only fair use for search purposes.

link

fasterik 1122 days ago

It's fair use if the work is "transformative". GPT-4 isn't publishing the content of the books, it's publishing a model derived from the entire corpus. I'm not a lawyer, but I think there's an argument that it is transformative.

link

still_grokking 1121 days ago

Imho as transformative as encoding a DVD as DivX…

It's correct that OpenAI isn't publishing any of the "stolen" content directly. But they "stole" it to make their service possible in the first place. Not distributing it themself doesn't make much difference than.

link