Hacker News new | ask | show | jobs
by d110af5ccf 1807 days ago
> I think your argument is still somewhat compelling, and some people will probably take your position.

I didn't mean to take a side or argue a position here. I was just pointing out that licenses hold no legal power in the event that copyright itself doesn't apply.

> ... So why wouldn't copyright restrict usage of source code in similar situations?

I'm certainly not an expert here but I believe you are mistaken about the extent to which current copyright law (in the US) restricts such usage. I also don't think that the examples you bring up are as simple as you seem to be making out.

You are legally permitted to record broadcast shows for later viewing; you are not permitted to redistribute the recordings though. I assume (but am not certain) that rentals and streaming are the same. (That being said, bypassing DRM has been made its own crime. This effectively amounts to an end run around the rights otherwise granted to you by US copyright law. But then there are specific exceptions where bypassing DRM is permitted. I digress.)

You aren't legally permitted to mirror the contents of a website (such as the New York Times) without permission but you are allowed to access it since they make it publicly available. You are even permitted to save a copy for your own purposes when you access it; you are not permitted to redistribute that copy.

For an extreme example, consider the recent LinkedIn case. Unless I misunderstood it, the court deemed it acceptable to scrape any publicly available content. Certainly most such scraped content was never explicitly licensed for that though!

Even if the license for a piece of code was entirely proprietary, GitHub presumably acquired it through legal means (ie intentional upload). Once they have it in their possession, it's not at all clear to me that current copyright law in the US has anything to say about how they use it (short of redistribution). Of course, if their ToS promises that they won't use it for other purposes then they can't do that. But assuming they never promised you that in the first place ...

There's a traditional argument here about needing a license to legally incorporate the copyrighted work of another into your own.

One possible counter argument is that training a model on publicly available work is analogous to a person viewing that work. So long as the model never outputs any of the original inputs (or only exceedingly small fragments of them that would fall under fair use regardless) it's not clear that those outputs constitute derivatives at all (in the legal sense). Or they might. The courts haven't weighed in yet as far as I know. (Consider GPT-3 or This Waifu Does Not Exist for additional examples of the sort of ambiguity that's possible here.)

Of course, one possible counter to that is that the model itself is (in many cases) effectively a lossily compressed copy of the original input works. So perhaps redistribution of the model itself would be a violation of copyright. But even if that turns out to be the case, it's still not clear that the output of such a model would run afoul of copyright.

1 comments

You have good points.

I argue that the output of an algorithm has the same copyright as the inputs to the algorithm, and that's because we use compilers (algorithms) to transform source code all the time already, and no one says that the binary code (outputs) is not copyrighted.

The trouble is there seems to be an entire continuum when it comes to degree of transformation.

The compiler produces more or less a direct (logical) translation so it's clearly some sort of derivative. We go from C to machine code but the output still "means" the same thing as the input. (More precisely, it's approximately a mathematically transformed subset of the original input. Lots of information is removed, things are reorganized, and a bit of extraneous information gets added in the process.)

For something notably more muddy than a compiler, consider This Waifu Does Not Exist. Any given output is (typically) nowhere near any particular input but you can often spot various strong resemblances.

Alternatively, the implementation of sketch-rnn (https://magenta.tensorflow.org/sketch-rnn-demo) is quite different - it outputs pen strokes instead of pixels. Still, the legal questions remain the same.

For a significantly muddier example, consider GPT-3. The outputs are (typically) not even remotely similar to anything that was input except in very broad strokes.

Where does Copilot fall along this continuum?

For even more confusion, consider running a New York Times article through Google Translate. Are you in the clear to publish that? I seriously doubt it.

But what about running it through an ML algorithm that (attempts to) produce a very brief summary of it? Many such implementations exist in the real world today. Their output is nothing like the input - should it still fall under the copyright of the original?

Finally, it's worth pointing out that for many of the above computerized tasks there are direct human equivalents. Art can be traced on a light table. A drawing can be produced that fuses the styles of two references. News articles can be manually translated or summarized.

Again, my intention here isn't to argue a particular side. I'm just trying to make it clear how complicated this stuff is and the fact that we don't have clear legal answers for most of it yet.