Hacker News new | ask | show | jobs
by mikehearn 1804 days ago
ML novice question: is this atypical when training models? Wasn't GPT-3 trained on a lot of copyrighted data? My gut instinct, which is based on very low-information, is that it would be pretty hard to train models if you could only use open-licensed material.
4 comments

It would be pretty concerning if people used GPT-3 while they were writing a novel, and it assisted them in plagiarizing a Steven King novel.

We already have examples of copilot blatantly plagiarizing code

Right, but that sounds like the bigger issue here is that the model might spit out copyrighted material, not just that it scrapes it. The former seems like a technology problem that Microsoft can solve.
The issue is that not only might the model spit out copyrighted material verbatim (which it is) but that it might also spit out non-obvious derivative works that will get you in legal hot water years down the road.
It is pretty concerning that copyright exists
Yes, it would stiffle NLP research immensely and we likely wouldn't see anything better than gpt3 for years if such restrictions are put in place.
You're basically seeing how some people would have had open source play out. You can look at and use the code but not to make money or in any other way that I personally disapprove of. This is a world where open source would have ended up being pretty much irrelevant.
Are we now also not seeing now why people would want to do that? A multi-billion dollar company using people work to make more profits without paying them.

I definitely understand why people pick a license that disallows use someone doesn't agree with. Imagine baking cookies for your friends, and one of them reselling them. The material effect is the same to you, you gave away your cookies, but sometimes you make/do something for a certain group of people and not for other to make a profit of your work.

People can do whatever they want with their work, including not sharing it at all.

But a great deal of the value that's come from open source generally has been that open source licenses haven't imposed the sort of usage-based restrictions (e.g. free for educational use only) that were fairly common in the PC world.

And, to your example, in the case of software the incremental copy that your friend sold cost you absolutely nothing. So it comes down to a purely emotional response to someone else making money off something you made.

>So it comes down to a purely emotional response to someone else making money off something you made.

Exactly, as I said, the material situation is the same. But we all are emotional beings, you would do certain things for your family you wouldn't for strangers. I don't think this case is any different.

I personally don't work for free for a company, but I do charity work for free. Working for a company in the time I work for a charity would "cost me absolutely nothing" if I already spend the time anyway, but everyone understands the difference.

There is a difference between a model that achieves "fair use" of copyrighted work and one that regurgitates copyrighted work without attribution.
You’re free to privately research with this data but commercializing other people’s work using ML is theft.

Edit: commercializing of the derived work is one explicit consideration used by US law in making a fair use determination. That said, even if it weren’t commercialized it may still be infringement and I believe it is.

Commercializing isn't really the issue, it's still copyright infringement even if you release it for free (i.e. piracy) -- it's unauthorized redistribution (i.e. copying).
Even if we accept that (which many wouldnt as most licenses say little about research), the research would never be very useful if you can never make a comparable dataset to use in the real world.
I get that the problem is commercializing, but the theories around copyright that are being deployed here would prevent even free, open-source NLP research from becoming a reality.
I am not a lawyer but I do believe GPT-3 as a commercial product trained using copyrighted data constitutes infringement. I also think GPT-2 does not because it is for research purposes, which made it fair use.
Yes training data is very valuable. Producing quality training data is an industry in itself. GitHub is trying to get it for free, doesn’t work that way.