Hacker News new | ask | show | jobs
by ghoward 1807 days ago
First, I did not come up with the term "code laundering." I cannot claim credit for that; I saw it first on HN on https://news.ycombinator.com/item?id=27729209 somewhere.

Second, you are correct that Copilot's maintainers claim that it bypasses copyright, but if it does while producing exact copies of code, then copyright is dead, and there are a lot of big companies out there with deep pockets that will ensure that doesn't happen.

They may claim that because their algorithm is a black box, that whatever it produces has no copyright, but my licenses will push back directly on that claim by saying that if source code under the license is used as all or part of the inputs to an algorithm, whether all of the source code or partially, then the license terms must be attached to the output. After all, that's what we do with GPL and binary code. The binary code is the output of an algorithm (the compiler) whose input was the source code.

I hope by tying it together like that, the terms can close the loophole they are claiming. But of course, I am going to get a lawyer to help me with those licenses.

2 comments

> ... if source code under the license is used as all or part of the inputs to an algorithm, whether all of the source code or partially, then the license terms must be attached to the output.

You're not getting it. If Copilot isn't currently infringing copyright then adding such a clause won't matter. Such a clause would only hold weight when copyright applies. On the other hand, if copyright does apply, then you don't need such a clause because the activity is already a violation of the vast majority of licenses. (It even violates extremely permissive ones because it effectively strips out the license notice.)

The GPL works specifically because copyright applies to the usecase in question. It simply specifies various requirements that you must meet in order to license the code given that copyright applies.

In short, you can't just put a clause into a license saying, effectively, "and also, this license confers superpowers which make it so that my copyright applies in additional situations where it otherwise wouldn't!".

I think the GP's "license" would still be effective, although it would not be "open source" per the OSI definition.

Imagine this simplified scenario first: if I published a source file publicly without any licensing or explanation except a standard copyright notice - "Copyright (C) 2021 MY NAME, all rights reserved", do you think a random person/company can take that code and integrate it into a commercial product?

I would argue not (in general). Copyrights law as it is, does not permit a user who has access to a copy to do whatever they want with that copy (esp. if it involves more copying). OSS licenses do give you much freedom as long as you don't modify it, and that's why we have impression that we can do whatever with publicized source code. However, if we think about other types of copyrighted work, say movies for example, streaming services can "rent" you a movie multiple times even though you've paid to download the content previously. What are you paying for the second time you rent? Another example - some photographers may allow you to freely browse their works, but they can still make you pay money if you want to use their photo in your commercial product.

So why wouldn't copyright restrict usage of source code in similar situations? The GP only needs to add a condition to the license to restrict how users can use it. It will no longer be OSS, but as long as it's his work, I don't see why in principle it shouldn't work.

(In practice, I don't think it will make much difference -- I think your argument is still somewhat compelling, and some people will probably take your position. Conservative corporate lawyers aimed at reducing legal risk would disagree, so it's basically a matter of how much legal risk one is ready to take. Also, for an author trying to do this, note that suing Microsoft in these cases would be expensive, since they will likely fight back given that they spent so much money trying to do this, and the outcome will be uncertain. If really tested in court, given the result of the Oracle v Google case, if the US Supreme Court is impressed by the social/economic benefits that Android brings, I'm pretty sure the justices will be even more impressed by this intelligent code generation thingy, and might just grant this thing a fair use.)

Your summary is generally correct, and I certainly agree with the other commenter's position on their work. But I think you're still missing the point. Copyright is the mechanism that allows you to prevent copying, but GitHub's claim is that copyright is irrelevant to Copilot's input.

I have a nice strong lock on my door. GitHub (asserts that it) can enter my home through the window.

Adding another deadbolt to the door does not help.

I don't think I missed that point. I'm trying to argue that copyright is relevant to Copilot's input if not allowed by an OSS license.

Maybe I'm missing something (just not the thing you said), but has Github made any legal claims so far? The original article is written by a politician in EU...

Even if you're a lawyer defending Github in this case, there's still a couple things that needs to be clarified before you can make the case: (maybe the info is out there but I'm too lazy to research)

- Is Github only using code/repos that are explicitly under OSS licenses? (because if that's the case, then the discussion might be justified in presuming OSS terms, and it may be the case that more restrictive non-OSS licenses would require a different analysis)

- As somebody pointed out in another thread, the Github terms of service agreement seems to grant Github additional rights when dealing with user uploaded content. Is that a legal basis for the use?

> I'm trying to argue that copyright is relevant to Copilot's input if not allowed by an OSS license.

And I tend to agree with you (and the other commenter) here. But GitHub doesn't.

> has Github made any legal claims so far?

I'm not sure how actively, but the CEO was here in the announcement thread the other day saying that they think the ingestion of the inputs is a "fair use". They also have some material defending the output side: https://docs.github.com/en/github/copilot/research-recitatio...

> Is Github only using code/repos that are explicitly under OSS licenses?

I don't think we know exactly what code they used as inputs, no.

Their argument defending the output side doesn't hold water, IMO. If Copilot produces exact copies verbatim, even some of the time, then as long as customers don't have access to the code used to generate the model, how can they be sure?

It's a matter of scale. With a big enough codebase, there will be copyright violations.

> I don't think I missed that point. I'm trying to argue that copyright is relevant to Copilot's input if not allowed by an OSS license.

The point (that they claim that) you are missing is that if "copyright is relevant to Copilot's input" then almost all existing OSS licenses already don't allow that.

The licenses that I am making implicitly acknowledge the argument that training an ML model is fair use.

However, GitHub said nothing about the output of the model being fair use. My license will say that the output of their model is under the same license as the input, which means they have restrictions if they want to distribute it (i.e., actually have people use Copilot).

I think this will work because it doesn't say that GitHub is wrong. Instead, it says that, even if GitHub is right, it doesn't matter.

It would also be very bad for GitHub to claim that the output of an algorithm can't be under the same license as the input because we feed licensed code to algorithms all the time and claim that their output is still under the same license. We call those algorithms "compilers" and the binary code they produce is still copyrighted and licensed.

> I think your argument is still somewhat compelling, and some people will probably take your position.

I didn't mean to take a side or argue a position here. I was just pointing out that licenses hold no legal power in the event that copyright itself doesn't apply.

> ... So why wouldn't copyright restrict usage of source code in similar situations?

I'm certainly not an expert here but I believe you are mistaken about the extent to which current copyright law (in the US) restricts such usage. I also don't think that the examples you bring up are as simple as you seem to be making out.

You are legally permitted to record broadcast shows for later viewing; you are not permitted to redistribute the recordings though. I assume (but am not certain) that rentals and streaming are the same. (That being said, bypassing DRM has been made its own crime. This effectively amounts to an end run around the rights otherwise granted to you by US copyright law. But then there are specific exceptions where bypassing DRM is permitted. I digress.)

You aren't legally permitted to mirror the contents of a website (such as the New York Times) without permission but you are allowed to access it since they make it publicly available. You are even permitted to save a copy for your own purposes when you access it; you are not permitted to redistribute that copy.

For an extreme example, consider the recent LinkedIn case. Unless I misunderstood it, the court deemed it acceptable to scrape any publicly available content. Certainly most such scraped content was never explicitly licensed for that though!

Even if the license for a piece of code was entirely proprietary, GitHub presumably acquired it through legal means (ie intentional upload). Once they have it in their possession, it's not at all clear to me that current copyright law in the US has anything to say about how they use it (short of redistribution). Of course, if their ToS promises that they won't use it for other purposes then they can't do that. But assuming they never promised you that in the first place ...

There's a traditional argument here about needing a license to legally incorporate the copyrighted work of another into your own.

One possible counter argument is that training a model on publicly available work is analogous to a person viewing that work. So long as the model never outputs any of the original inputs (or only exceedingly small fragments of them that would fall under fair use regardless) it's not clear that those outputs constitute derivatives at all (in the legal sense). Or they might. The courts haven't weighed in yet as far as I know. (Consider GPT-3 or This Waifu Does Not Exist for additional examples of the sort of ambiguity that's possible here.)

Of course, one possible counter to that is that the model itself is (in many cases) effectively a lossily compressed copy of the original input works. So perhaps redistribution of the model itself would be a violation of copyright. But even if that turns out to be the case, it's still not clear that the output of such a model would run afoul of copyright.

You have good points.

I argue that the output of an algorithm has the same copyright as the inputs to the algorithm, and that's because we use compilers (algorithms) to transform source code all the time already, and no one says that the binary code (outputs) is not copyrighted.

The trouble is there seems to be an entire continuum when it comes to degree of transformation.

The compiler produces more or less a direct (logical) translation so it's clearly some sort of derivative. We go from C to machine code but the output still "means" the same thing as the input. (More precisely, it's approximately a mathematically transformed subset of the original input. Lots of information is removed, things are reorganized, and a bit of extraneous information gets added in the process.)

For something notably more muddy than a compiler, consider This Waifu Does Not Exist. Any given output is (typically) nowhere near any particular input but you can often spot various strong resemblances.

Alternatively, the implementation of sketch-rnn (https://magenta.tensorflow.org/sketch-rnn-demo) is quite different - it outputs pen strokes instead of pixels. Still, the legal questions remain the same.

For a significantly muddier example, consider GPT-3. The outputs are (typically) not even remotely similar to anything that was input except in very broad strokes.

Where does Copilot fall along this continuum?

For even more confusion, consider running a New York Times article through Google Translate. Are you in the clear to publish that? I seriously doubt it.

But what about running it through an ML algorithm that (attempts to) produce a very brief summary of it? Many such implementations exist in the real world today. Their output is nothing like the input - should it still fall under the copyright of the original?

Finally, it's worth pointing out that for many of the above computerized tasks there are direct human equivalents. Art can be traced on a light table. A drawing can be produced that fuses the styles of two references. News articles can be manually translated or summarized.

Again, my intention here isn't to argue a particular side. I'm just trying to make it clear how complicated this stuff is and the fact that we don't have clear legal answers for most of it yet.

Ah, I see.

I argue that, even if training a dataset is fair use, distributing the result is copyright infringement. I would want my license to make that part clearer.

> even if training a dataset is fair use, distributing the result is copyright infringement

I would be inclined to agree that the current situation (ie reproducing training examples verbatim) violates copyright. On the other hand, I'm not so sure that a trained model does (or even should) be subject to the copyright of the inputs.

Of course I acknowledge that the latter view is controversial and also that such issues are so new that they haven't had a chance to be meaningfully addressed by either the courts or the legislature yet.

As an example of a similar situation, see (https://www.thiswaifudoesnotexist.net/) which was trained entirely on copyrighted artwork. Note that there are at least three distinct issues here - training the model, distributing the model itself, and distributing the output of the model.

> I would want my license to make that part clearer.

But again, GitHub's argument here is that the license is completely irrelevant because it doesn't apply in the first place. Thus they won't care one bit about any clarifications you make one way or the other.

You said that you're "not so sure that a trained model does (or even should) be subject to the copyright of the inputs."

You missed my point. I'm not saying that the model is subject to the copyright of the inputs; I'm saying that the model's outputs are, which is entirely different. We say that the output of a compiler is still subject to the copyright of the inputs, so why not this?

I misspoke. (Err mistyped?) I suspect there will often be a stronger case to be made for the model itself falling under copyright than what it outputs. It's up to the courts and the legislature in the end though, so who knows.

Anyway, by providing public access to this thing I infer GitHub to be taking the position that copyright doesn't apply to the output. (And I suspect they are wrong, in particular because of the verbatim code samples people have managed to coax out of it.)

> even if training a dataset is fair use, distributing the result is copyright infringement

That seems an unlikely legal argument. It would defeat the point of fair use if you couldn’t distribute the result.

And no copyright license can override copyright law. Licenses can only grant rights, they can’t take them away.

Can you add fines?
I wish. I just want users to know what rights they have. Ultimately, I want my software to serve end users, not companies. If companies add value for users with my software, that's exactly what I want.

But stripping licenses away so that users can't know what rights they have with my code is not that.