Hacker News new | ask | show | jobs
by hnfong 1807 days ago
I think the GP's "license" would still be effective, although it would not be "open source" per the OSI definition.

Imagine this simplified scenario first: if I published a source file publicly without any licensing or explanation except a standard copyright notice - "Copyright (C) 2021 MY NAME, all rights reserved", do you think a random person/company can take that code and integrate it into a commercial product?

I would argue not (in general). Copyrights law as it is, does not permit a user who has access to a copy to do whatever they want with that copy (esp. if it involves more copying). OSS licenses do give you much freedom as long as you don't modify it, and that's why we have impression that we can do whatever with publicized source code. However, if we think about other types of copyrighted work, say movies for example, streaming services can "rent" you a movie multiple times even though you've paid to download the content previously. What are you paying for the second time you rent? Another example - some photographers may allow you to freely browse their works, but they can still make you pay money if you want to use their photo in your commercial product.

So why wouldn't copyright restrict usage of source code in similar situations? The GP only needs to add a condition to the license to restrict how users can use it. It will no longer be OSS, but as long as it's his work, I don't see why in principle it shouldn't work.

(In practice, I don't think it will make much difference -- I think your argument is still somewhat compelling, and some people will probably take your position. Conservative corporate lawyers aimed at reducing legal risk would disagree, so it's basically a matter of how much legal risk one is ready to take. Also, for an author trying to do this, note that suing Microsoft in these cases would be expensive, since they will likely fight back given that they spent so much money trying to do this, and the outcome will be uncertain. If really tested in court, given the result of the Oracle v Google case, if the US Supreme Court is impressed by the social/economic benefits that Android brings, I'm pretty sure the justices will be even more impressed by this intelligent code generation thingy, and might just grant this thing a fair use.)

2 comments

Your summary is generally correct, and I certainly agree with the other commenter's position on their work. But I think you're still missing the point. Copyright is the mechanism that allows you to prevent copying, but GitHub's claim is that copyright is irrelevant to Copilot's input.

I have a nice strong lock on my door. GitHub (asserts that it) can enter my home through the window.

Adding another deadbolt to the door does not help.

I don't think I missed that point. I'm trying to argue that copyright is relevant to Copilot's input if not allowed by an OSS license.

Maybe I'm missing something (just not the thing you said), but has Github made any legal claims so far? The original article is written by a politician in EU...

Even if you're a lawyer defending Github in this case, there's still a couple things that needs to be clarified before you can make the case: (maybe the info is out there but I'm too lazy to research)

- Is Github only using code/repos that are explicitly under OSS licenses? (because if that's the case, then the discussion might be justified in presuming OSS terms, and it may be the case that more restrictive non-OSS licenses would require a different analysis)

- As somebody pointed out in another thread, the Github terms of service agreement seems to grant Github additional rights when dealing with user uploaded content. Is that a legal basis for the use?

> I'm trying to argue that copyright is relevant to Copilot's input if not allowed by an OSS license.

And I tend to agree with you (and the other commenter) here. But GitHub doesn't.

> has Github made any legal claims so far?

I'm not sure how actively, but the CEO was here in the announcement thread the other day saying that they think the ingestion of the inputs is a "fair use". They also have some material defending the output side: https://docs.github.com/en/github/copilot/research-recitatio...

> Is Github only using code/repos that are explicitly under OSS licenses?

I don't think we know exactly what code they used as inputs, no.

Their argument defending the output side doesn't hold water, IMO. If Copilot produces exact copies verbatim, even some of the time, then as long as customers don't have access to the code used to generate the model, how can they be sure?

It's a matter of scale. With a big enough codebase, there will be copyright violations.

> I don't think I missed that point. I'm trying to argue that copyright is relevant to Copilot's input if not allowed by an OSS license.

The point (that they claim that) you are missing is that if "copyright is relevant to Copilot's input" then almost all existing OSS licenses already don't allow that.

The licenses that I am making implicitly acknowledge the argument that training an ML model is fair use.

However, GitHub said nothing about the output of the model being fair use. My license will say that the output of their model is under the same license as the input, which means they have restrictions if they want to distribute it (i.e., actually have people use Copilot).

I think this will work because it doesn't say that GitHub is wrong. Instead, it says that, even if GitHub is right, it doesn't matter.

It would also be very bad for GitHub to claim that the output of an algorithm can't be under the same license as the input because we feed licensed code to algorithms all the time and claim that their output is still under the same license. We call those algorithms "compilers" and the binary code they produce is still copyrighted and licensed.

> I think your argument is still somewhat compelling, and some people will probably take your position.

I didn't mean to take a side or argue a position here. I was just pointing out that licenses hold no legal power in the event that copyright itself doesn't apply.

> ... So why wouldn't copyright restrict usage of source code in similar situations?

I'm certainly not an expert here but I believe you are mistaken about the extent to which current copyright law (in the US) restricts such usage. I also don't think that the examples you bring up are as simple as you seem to be making out.

You are legally permitted to record broadcast shows for later viewing; you are not permitted to redistribute the recordings though. I assume (but am not certain) that rentals and streaming are the same. (That being said, bypassing DRM has been made its own crime. This effectively amounts to an end run around the rights otherwise granted to you by US copyright law. But then there are specific exceptions where bypassing DRM is permitted. I digress.)

You aren't legally permitted to mirror the contents of a website (such as the New York Times) without permission but you are allowed to access it since they make it publicly available. You are even permitted to save a copy for your own purposes when you access it; you are not permitted to redistribute that copy.

For an extreme example, consider the recent LinkedIn case. Unless I misunderstood it, the court deemed it acceptable to scrape any publicly available content. Certainly most such scraped content was never explicitly licensed for that though!

Even if the license for a piece of code was entirely proprietary, GitHub presumably acquired it through legal means (ie intentional upload). Once they have it in their possession, it's not at all clear to me that current copyright law in the US has anything to say about how they use it (short of redistribution). Of course, if their ToS promises that they won't use it for other purposes then they can't do that. But assuming they never promised you that in the first place ...

There's a traditional argument here about needing a license to legally incorporate the copyrighted work of another into your own.

One possible counter argument is that training a model on publicly available work is analogous to a person viewing that work. So long as the model never outputs any of the original inputs (or only exceedingly small fragments of them that would fall under fair use regardless) it's not clear that those outputs constitute derivatives at all (in the legal sense). Or they might. The courts haven't weighed in yet as far as I know. (Consider GPT-3 or This Waifu Does Not Exist for additional examples of the sort of ambiguity that's possible here.)

Of course, one possible counter to that is that the model itself is (in many cases) effectively a lossily compressed copy of the original input works. So perhaps redistribution of the model itself would be a violation of copyright. But even if that turns out to be the case, it's still not clear that the output of such a model would run afoul of copyright.

You have good points.

I argue that the output of an algorithm has the same copyright as the inputs to the algorithm, and that's because we use compilers (algorithms) to transform source code all the time already, and no one says that the binary code (outputs) is not copyrighted.

The trouble is there seems to be an entire continuum when it comes to degree of transformation.

The compiler produces more or less a direct (logical) translation so it's clearly some sort of derivative. We go from C to machine code but the output still "means" the same thing as the input. (More precisely, it's approximately a mathematically transformed subset of the original input. Lots of information is removed, things are reorganized, and a bit of extraneous information gets added in the process.)

For something notably more muddy than a compiler, consider This Waifu Does Not Exist. Any given output is (typically) nowhere near any particular input but you can often spot various strong resemblances.

Alternatively, the implementation of sketch-rnn (https://magenta.tensorflow.org/sketch-rnn-demo) is quite different - it outputs pen strokes instead of pixels. Still, the legal questions remain the same.

For a significantly muddier example, consider GPT-3. The outputs are (typically) not even remotely similar to anything that was input except in very broad strokes.

Where does Copilot fall along this continuum?

For even more confusion, consider running a New York Times article through Google Translate. Are you in the clear to publish that? I seriously doubt it.

But what about running it through an ML algorithm that (attempts to) produce a very brief summary of it? Many such implementations exist in the real world today. Their output is nothing like the input - should it still fall under the copyright of the original?

Finally, it's worth pointing out that for many of the above computerized tasks there are direct human equivalents. Art can be traced on a light table. A drawing can be produced that fuses the styles of two references. News articles can be manually translated or summarized.

Again, my intention here isn't to argue a particular side. I'm just trying to make it clear how complicated this stuff is and the fact that we don't have clear legal answers for most of it yet.