| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rpdillon 40 days ago

How did you draw those conclusions? They don't seem to be in line with court rulings (i.e. Anthropic), which hold that training is fair use. Code is being treated the same as any other copyrighted content that is used for training, from blog posts to PR announcements from companies and everything in between. Of course the blog posts are PR announcements have their copyright held by their authors, with no license provided at all, so if OSS code being used in training is a violation, then so would everything being trained on (to a first approximation...public domain works excepted). But no court has every taken that position to my knowledge.

There's just so much confusion around this. In this thread alone:

* Distillation is legal under copyright; the violations would come as ToS violations, which is contract law, not copyright law.

* Training is legal as well, so long as the original material was obtained legally.

* Moving code off of GitHub doesn't change any of this: AI companies are free to download your git repo no matter where it is hosted, just like they can any other content on a publicly accessible website.

* Liability comes into the picture when the models are used to infringe copyright in their output. We'll have to see the outcome of the NYT case here, but that is proceeding at a glacial pace.

I am not a lawyer; I'm an interested amateur that's been following the saga for years. I wish the discussion here on HN were more nuanced.

If anyone has legal updates that render any of the above incorrect, I'd love a pointer to the decisions. One area I'm particularly weak is the legal status in countries that are not the US: I don't follow those laws nearly as carefully, nor the court cases brought.

2 comments

mkhalil 40 days ago

>> * Moving code off of GitHub doesn't change any of this: AI companies are free to download your git repo no matter where it is hosted, just like they can any other content on a publicly accessible website.

C'mon, I'm not even apart of the movement to move away from GitHub, but that's not really a valid argument. Sure, they CAN download the source code, but its not nearly as automatic. They don't get to download it all, en masse, from copying hard drives/databases they already own. They have to go over the internet. They don't get automatic notifications when new code gets pushed. And finally, if one wanted, they can make it harder for bots.

I certainly believe that these companies do get away with a lot more than the average Joe - see: Facebook downloading Anna's Archive, every pirated eBook - but that doesn't mean you have to hand it to them on a silver platter.

Plus, even if your code is private on GitHub, you can guarantee that they can't train there models on it anyway; unlike if you host it yourself, or somewhere else.

Does anyone else find it ironic when closed-source GitHub claims it's some super hero for open source?

link

bayindirh 40 days ago

I have written about this numerous times, so I won't repeat myself with the long form writing. Maybe I need to keep a list of comments somewhere, so I can reference them. I digress...

In short:

- GPL code requires attribution and sharing of code. Models strip license, so GPL is effectively violated.

- Source available licenses are "for your eyes" only, so training on source available code is also violates said code's licenses.

- MIT requires attribution, but forgetting it has no consequences, so it's a more gray area.

About moving from GitHub:

- Some public repositories provide visible and invisible anti-scraping protections. So it's not always that easy.

- GPL says I need to share code to the people who downloads the application itself, so I can move to cathedral model.

Moreover:

- US Government has a stance of "If we need to take permission for everything, AI industry will die". Hence, as an outsider, the court rulings have no weight in my eyes. They are taking stance to enable and not hinder the industry. If one reads Fair Use doctrine, it's very possible to rule otherwise. OpenAI's whole non-profit research arm was an instrument to circumvent Fair Use doctrine's "earn money from copyrighted works" clause and support "we only do research pinky promise" requirement of the said doctrine.

When courts said "go ahead, we're not looking", people started to torrent e-books (ahem Meta ahem) to train models or buy/cut/scan/ocr books to train their models (Anthropic).

So the situation is left murky to allow Silicon Valley to thrive. Not to protect people's blood, sweat and tears. These works are provided by peasants anyway, so why bother.

Addenda: Courts said models' outputs can't be copyrighted. So, copyrighted code gets in, non-copyrightable code gets out. It's effectively license-washing.

link

rpdillon 40 days ago

I don't think your understanding of Fair Use matches mine, but it is important, since it invalidates the concern about licensing.

I wrote a nearby comment giving some resources on the current state of Fair Use for training, but in short: it depends.

https://news.ycombinator.com/item?id=48125071

> Hence, as an outsider, the court rulings have no weight in my eyes.

My only focus in on legality, so this doesn't track for me. If we're not talking about what courts are ruling, then there's nothing to talk about legally, since the copyright office is waiting on courts to rule here.

link

zdragnar 40 days ago

The GitHub terms of service has always granted GitHub additional rights. If you put up code with a license incompatible with those rights, then you are the responsible party for the violation, again as per GitHub's terms of service.

This was true before AI, and the ToS now explicitly includes AI training to avoid confusion.

In short: it has never been a good idea to put anything with a copy left or strong license up on GitHub if you wanted them to abide by it.

link

account42 36 days ago

> If you put up code with a license incompatible with those rights, then you are the responsible party for the violation, again as per GitHub's terms of service.

This is not how copyright law works or any other law for that matter. The issue is foremost between the copyright author and GitHub. The ToS may or may not allow GitHub to sue the uploader for damages for a ToS does not magically give them rights that the uploader isn't legally able to give.

link