Hacker News new | ask | show | jobs
by tedivm 1091 days ago
My point is that you have to separate the method for collecting the data versus the usage of the data as separate legal questions. Scraping is legal. What you do with the data that you scrap though is a whole other question.

To put it another way, it's legal for me to go to the library and borrow a DVD or a book or poems. That doesn't give me the right to publish the poems again under my own name. Whether I find the poems from scraping, borrowing the book from a library, or even just reading it off of a wall I don't get ownership rights to that data.

The same logic applies to a lot of other laws around data. If you collect data on individuals there are a bunch of laws that come up around it, and many of them don't really concern themselves with how you got the data so much as how you use it. The fact that it was scraped doesn't grant any special legal rights.

1 comments

What you describe misrepresents how LLMs/neural networks and the math works, your analogy does not apply. There's no static data in the networks. The output of LLMs are much closer to parodies and fanfiction. In that case, you very clearly own the copyright to the new work you make.
That's weird, since my comment literally said nothing about LLMs. I was simply pointing out that making scraping legal doesn't invalidate any of the other data laws that were out there, and gave one example.

You keep making the claim that because it was scraped people can do whatever they want, as scraping is legal. That is the only thing I'm arguing against, because that is a gross misinterpretation of how the case that made scraping legal was decided. LLMs aren't relevant to that point (which is exactly what I keep saying- the method of collection doesn't magically change the legality of it).

That being said, you're still wrong. The USPO has said that the output of LLMs are the outputs of algorithms and are not creative works. Therefore you can't "own the copyright to the new work you make" because the work itself can't be copyrighted at all. No one can own the output of an LLM.

Also, just because it seems you want to be wrong on every level, it is absolutely possible that a neural network would be able to repeat data from its training set. This is an incredibly known problem in the field.

I see your perspective better now. The Linkedin case was specifically regarding CFAA and is relevant to the original suit against OpenAI and web scraping, but I now see you weren't discussing that. The copyright limit you mention is related to completely automated generations, it's not as clear when a human uses it. The UK assigns the copyright to the user/custodian of the AI. The neural network models can repeat data, but it requires a certain frequency, and still relies on a probabilistic output. The complication comes from the fact that there is no "copying" when training a model. Fundamentally, I think we disagree on how data use laws apply in this situation. I appreciate you discussing this with me, it did helped clear some misunderstandings I had.

https://www.bloomberglaw.com/external/document/XDDQ1PNK00000...