| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tedivm 1091 days ago

The lawsuit is far more nuanced than you're letting on. There are several aspects that come into play-

* Was it published publicly? This is basically defined in the courts as "if you make an unauthenticated web request does the data return?". This is where scraping comes in- if you make the data available without authentication you can't enforce your TOS, because you can't validate that people actually even accepted the TOS to begin with.

* Is the data able to be copyrighted? This is where things are interesting- facts can not be copyrighted, which is why a lot of scrapers are able to reuse data (things like weather, sports scores, even "for hire" notices can be considered factual).

* If it would typically be considered covered by copyright, does fair use come into play?

* Are there any other laws that come into play? For example, GDPR, CCPA, or other privacy laws can still add restrictions to how data is collected and used (this is complicated by the various jurisdictions as well)

* Was the work done with the data transformative enough to allow it to bypass copyright protections? This goes back to when Google was scanning books. Because they were making a search engine, not a library, their search tool was considered transformative enough to allow them to continue.

It's not enough to say "because it's on the internet, it's fair game for everyone to use". This is a really complicated area where things are evolving rapidly, and there's a lot of intersecting law (and case law) that comes into play.

1 comments

knaik94 1091 days ago

I agree that there is additional nuance, but so far public data scraping has very clearly been ruled as legal. It's possible that at the time of scraping, copyrighted data was incorporated into the training data because it hadn't been taken down by the host platform yet. But in my opinion, the core idea proposed by the suit that private data was used intentionally, is not true. The GPT4 browsing plugin is equivalent to web scraping.

And another complication is that OpenAI is not exposing any static data. A response is generated only after prompting. I'd argue that LLMs are closer to calculators than databses in function. The amount of new information that can be added is also limited, it's is not a continuous learning/training architecture.

I do hope this leads to more clear laws regarding data privacy, but I can't imagine the allegations of "intercepting communications", violating CFAA, or violating unfair competition law will hold.

link

tedivm 1091 days ago

My point is that you have to separate the method for collecting the data versus the usage of the data as separate legal questions. Scraping is legal. What you do with the data that you scrap though is a whole other question.

To put it another way, it's legal for me to go to the library and borrow a DVD or a book or poems. That doesn't give me the right to publish the poems again under my own name. Whether I find the poems from scraping, borrowing the book from a library, or even just reading it off of a wall I don't get ownership rights to that data.

The same logic applies to a lot of other laws around data. If you collect data on individuals there are a bunch of laws that come up around it, and many of them don't really concern themselves with how you got the data so much as how you use it. The fact that it was scraped doesn't grant any special legal rights.

link

knaik94 1091 days ago

What you describe misrepresents how LLMs/neural networks and the math works, your analogy does not apply. There's no static data in the networks. The output of LLMs are much closer to parodies and fanfiction. In that case, you very clearly own the copyright to the new work you make.

link

tedivm 1091 days ago

That's weird, since my comment literally said nothing about LLMs. I was simply pointing out that making scraping legal doesn't invalidate any of the other data laws that were out there, and gave one example.

You keep making the claim that because it was scraped people can do whatever they want, as scraping is legal. That is the only thing I'm arguing against, because that is a gross misinterpretation of how the case that made scraping legal was decided. LLMs aren't relevant to that point (which is exactly what I keep saying- the method of collection doesn't magically change the legality of it).

That being said, you're still wrong. The USPO has said that the output of LLMs are the outputs of algorithms and are not creative works. Therefore you can't "own the copyright to the new work you make" because the work itself can't be copyrighted at all. No one can own the output of an LLM.

Also, just because it seems you want to be wrong on every level, it is absolutely possible that a neural network would be able to repeat data from its training set. This is an incredibly known problem in the field.

link

knaik94 1091 days ago

I see your perspective better now. The Linkedin case was specifically regarding CFAA and is relevant to the original suit against OpenAI and web scraping, but I now see you weren't discussing that. The copyright limit you mention is related to completely automated generations, it's not as clear when a human uses it. The UK assigns the copyright to the user/custodian of the AI. The neural network models can repeat data, but it requires a certain frequency, and still relies on a probabilistic output. The complication comes from the fact that there is no "copying" when training a model. Fundamentally, I think we disagree on how data use laws apply in this situation. I appreciate you discussing this with me, it did helped clear some misunderstandings I had.

https://www.bloomberglaw.com/external/document/XDDQ1PNK00000...

link

TechBro8615 1091 days ago

Even if they were exposing static data, how would that be different than a search engine? Google has been scraping the web for two decades, indexing even explicitly copyrighted content, and then making money by selling ads next to snippets from that content. If you're going to make the case that an LLM is violating copyright, then surely you must also assert that Google is too, because it's the same concept, but Google is actually surfacing exact text from the copyrighted material.

link

wizzwizz4 1091 days ago

By putting something on a public-facing website, it's generally agreed that (absent a robots.txt to the contrary), you intend it to appear in web search results, and you're granting a public limited semi-transferable revocable license to request, download and view your site to your visitors.

That doesn't mean you grant a license to produce derivative works other than search indexes. Legally, it's different. (Germany codifies these as separate "moral rights": Urheberpersönlichkeitsrecht.)

link