Hacker News new | ask | show | jobs
by ipsum2 1307 days ago
If you put your code on Github, it's bound by the TOS, which states (https://docs.github.com/en/site-policy/github-terms/github-t...):

> We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

Doesn't that contradict the purpose of this website? Is this performance art?

Also, why is this website so secretive? Why not publish the license on the website?

> Who is PayToTrain created by and why?

> PayToTrain is created by a small group of developers and attorneys who are passionate about open source software and ensuring that developers are properly compensated for their work. The website and service are provided completely free of charge.

Edit: PayToTrain looks like a non-disclosed ad and/or project from legalist . com.

4 comments

I agree Humans Only Clause does not prevent Microsoft from training Copilot from codes on GitHub due to GitHub's terms of service, but I think it does prevent, say, Salesforce from training its CodeGen model.

So if the clause is widely adopted, it may be good for Microsoft and bad for Salesforce. If you want to reward Microsoft and punish Salesforce, it may be a good idea.

It shouldn't even hinge on a TOS.

If Microsoft loses this case, it actually means Microsoft wins and we all lose.

Who has a large enough corpora of training data? Only institutional copyright holders.

This is probably going to play out like Oracle vs. Google when Google suddenly realized that they should lose and intentionally threw the case.

I'm so worried about this case. Treating copyrighted training data as fair use, and letting models learn as a child might learn from a book or movie, is the best way to proceed. It widens the playing field for both development and disruption.

I would happily contribute all of my public source code under whatever license to a dataset that required models to also be open sourced. I am not OK with Microsoft creating a derivative work (Copilot's model) off of GPL code and not releasing the weights under the GPL.
I tend to agree, but to play devil's advocate, if we were talking about a biological neural network (person) training themselves by looking at GPL code, the GPL would of course not apply to code they release later in general.
> if we were talking about a biological neural network (person) training themselves by looking at GPL code...

The thing is, said person both reads way less code than a non-biological neural network, and emits its derivations based on many inputs regardless of the code it ingested via its high resolution multi focal adaptive light sensors. Including but not limited to experiences, communication with other biological neural networks, human-machine code translators (compilers), daily unpredictable hormone fluctuations and infinite other inputs it processed which affected all aspects of its cranial muscle and daily living circumstances and choices.

IOW, A neural network is neither a person, nor learns the same way or derives and emits the same way.

This is equal with claiming that a Furby is a person, just because it can babble and blink.

I'm not saying all neural networks are people, im saying people are a subset of neural networks by definition. We don't have any idea how consciousness works, and our brains are essentially still black boxes.

In a similar way, noone really understands intuitively how these ML models are actually working (we treat them as mostly black boxes in practice), in contrast to looking at an equation for instance. I have played with some of these text generation models, and frankly we are already at the point where deciding whether or not they pass the turing test depends on the details rather than the spirit of the rules for the test. It may not be a coincidence that NNs designed to replicate our own brain structure also replicate important aspects of our cognition.

These are not living beings though. They are programs. No one is arguing against humans learning.

>Neural networks approximate the function represented by your data.

The amount of data (not just code) that we would need to get sign off on is prohibitively large. If you account for all the stakeholders, then this won't be easy at all.

Meanwhile, the institutions will leap ahead of us. Models and annotated data sets will forever be out of reach. Open source equivalents will be severely behind the status quo.

Strongly disagree, institutions have an advantage in having the data, but they do not have the capability of creating more.

An intentional open source dataset could target new domains that there was no institutional will to pursue. I strongly believe that the open source community's capabilities far exceed that of any single large corporation.

> This is probably going to play out like Oracle vs. Google when Google suddenly realized that they should lose and intentionally threw the case.

What are you talking about here?

> I'm so worried about this case. Treating copyrighted training data as fair use, and letting models learn as a child might learn from a book or movie, is the best way to proceed. It widens the playing field for both development and disruption.

Which also opens the floodgates for bleeding GPL and other copylefted code to proprietary realm. Very convenient, yes.

I’m never OK with someone making a derivation engine which offers my GPL code to a closed source base.

> Oracle vs. Google when Google suddenly realized that they should lose and intentionally threw the case.

But Google won?

You casting doubt on this with a frontal assault. Read your post and wanted to check out the 'show' that was implied but all immediately my eyes fell upon this:

Add our “Humans Only Clause” to your MIT license. Your code is still open source — for human developers only.

Sore disappointed that there is no entertainment involved. That's actually a pretty cool idea.

So github doesn't have (could be wrong) a default license grant or a over-riding licensing agreement. Your project, your license. If you change the license of your project, that is entirely your choice.

As to the Q of should we be generous to our corporate masters or take this opportunity to stick to the man and get rewarded for our mind products and compensation for being geeks! Society does owe something, does it not? /G

It's worth having a discussion about it, imo.

Yeah, my content, but I don't necessarily have the ability to give them this, because people uploading code to Github don't always have the ability to grant them any license on the code uploaded.
The only way to add or read the license is to give them read and write access to all public repos in your Github account. Strange.
I can touch on this, the read and write access is for future iterations to be able to automatically add / append the license text to the file. Adding the licenses one by one is tedious.