Hacker News new | ask | show | jobs
by knaik94 1091 days ago
I agree that there is additional nuance, but so far public data scraping has very clearly been ruled as legal. It's possible that at the time of scraping, copyrighted data was incorporated into the training data because it hadn't been taken down by the host platform yet. But in my opinion, the core idea proposed by the suit that private data was used intentionally, is not true. The GPT4 browsing plugin is equivalent to web scraping.

And another complication is that OpenAI is not exposing any static data. A response is generated only after prompting. I'd argue that LLMs are closer to calculators than databses in function. The amount of new information that can be added is also limited, it's is not a continuous learning/training architecture.

I do hope this leads to more clear laws regarding data privacy, but I can't imagine the allegations of "intercepting communications", violating CFAA, or violating unfair competition law will hold.

2 comments

My point is that you have to separate the method for collecting the data versus the usage of the data as separate legal questions. Scraping is legal. What you do with the data that you scrap though is a whole other question.

To put it another way, it's legal for me to go to the library and borrow a DVD or a book or poems. That doesn't give me the right to publish the poems again under my own name. Whether I find the poems from scraping, borrowing the book from a library, or even just reading it off of a wall I don't get ownership rights to that data.

The same logic applies to a lot of other laws around data. If you collect data on individuals there are a bunch of laws that come up around it, and many of them don't really concern themselves with how you got the data so much as how you use it. The fact that it was scraped doesn't grant any special legal rights.

What you describe misrepresents how LLMs/neural networks and the math works, your analogy does not apply. There's no static data in the networks. The output of LLMs are much closer to parodies and fanfiction. In that case, you very clearly own the copyright to the new work you make.
That's weird, since my comment literally said nothing about LLMs. I was simply pointing out that making scraping legal doesn't invalidate any of the other data laws that were out there, and gave one example.

You keep making the claim that because it was scraped people can do whatever they want, as scraping is legal. That is the only thing I'm arguing against, because that is a gross misinterpretation of how the case that made scraping legal was decided. LLMs aren't relevant to that point (which is exactly what I keep saying- the method of collection doesn't magically change the legality of it).

That being said, you're still wrong. The USPO has said that the output of LLMs are the outputs of algorithms and are not creative works. Therefore you can't "own the copyright to the new work you make" because the work itself can't be copyrighted at all. No one can own the output of an LLM.

Also, just because it seems you want to be wrong on every level, it is absolutely possible that a neural network would be able to repeat data from its training set. This is an incredibly known problem in the field.

I see your perspective better now. The Linkedin case was specifically regarding CFAA and is relevant to the original suit against OpenAI and web scraping, but I now see you weren't discussing that. The copyright limit you mention is related to completely automated generations, it's not as clear when a human uses it. The UK assigns the copyright to the user/custodian of the AI. The neural network models can repeat data, but it requires a certain frequency, and still relies on a probabilistic output. The complication comes from the fact that there is no "copying" when training a model. Fundamentally, I think we disagree on how data use laws apply in this situation. I appreciate you discussing this with me, it did helped clear some misunderstandings I had.

https://www.bloomberglaw.com/external/document/XDDQ1PNK00000...

Even if they were exposing static data, how would that be different than a search engine? Google has been scraping the web for two decades, indexing even explicitly copyrighted content, and then making money by selling ads next to snippets from that content. If you're going to make the case that an LLM is violating copyright, then surely you must also assert that Google is too, because it's the same concept, but Google is actually surfacing exact text from the copyrighted material.
By putting something on a public-facing website, it's generally agreed that (absent a robots.txt to the contrary), you intend it to appear in web search results, and you're granting a public limited semi-transferable revocable license to request, download and view your site to your visitors.

That doesn't mean you grant a license to produce derivative works other than search indexes. Legally, it's different. (Germany codifies these as separate "moral rights": Urheberpersönlichkeitsrecht.)