Hacker News new | ask | show | jobs
by cqqxo4zV46cp 921 days ago
Yep. That is the question. Anyone that immediately comes back with “it’s stealing!”, especially those that were confidently saying it when this first became an issue, long before they would’ve had any time to contemplate it deeply, are just proving that techies’ sense of transferable expertise is completely unfounded.
1 comments

No it isn’t, because ‘steeling’ is allowed.

There’s no question these neural networks and their output are derivative works. However being a derivative work isn’t enough to guarantee copyright infringement.

So, the only question is if we are going to carve out an exception here or not. The idea someone can use a VCR to copy live TV and let people watch it later came out of a court case not copyright law. There’s a lot of such exceptions, but getting one isn’t guaranteed.

> There’s no question these neural networks and their output are derivative works.

In the two US cases we have any progress on so far, the established requirement for substantial similarity (opposed to "dependant on" or such) has been upheld, with Judge Vince Chhabria specifically setting out that it'd "have to mean that if you put the Llama language model next to Sarah Silverman's book, you would say they're similar". and Judge William H. Orrick agreeing with the defendants that "plaintiffs cannot plausibly allege the Output Images are substantially similar or re-present protected aspects of copyrighted Training Images, especially in light of plaintiffs’ admission that Output Images are unlikely to look like the Training Images".

The UK definition of derivative works is, to my understanding, narrower and specifically enumerated as opposed to the US's more open-ended definition.

The remaining area of doubt, assuming the above remains consistent, is over the transient copying that occurs during training.

> the transient copying that occurs during training.

i think this should be dismissed as it is the same level of transience as the workings of the internet; you and your ISP, caching proxies etc, all made a transient copy as part of the existing (legal) consumption of the works that the author has put online.

Unless the works was illegally copied for training - which cannot be true if the works was publicly available for viewing on the internet, this transient copying cannot be a valid infringement.

Doing something a little isn’t the same a doing something a lot. You can walk into a restaurant and look at a menu for 5 minutes and then leave without issue but try to do that same thing for 8 hours.

Downloading a singe transient copy of some image once in the lifetime of a company is different than doing that same action a hundred times once for each version of the network.

This case involves a many examples of substantial similarity. Worse it’s precedent that generative AI doesn’t necessarily avoid creating such examples.

Defendants can easily argue that being 1/10 millionth or whatever of the training set means their specific work is unlikely to show up in any specific example but the underlying mechanism means it can be recreated.

The defendants will evidently claim transient copying.
I doubt these companies constantly downloading the full training set rather than keeping it in a database somewhere.

Hard to argue keeping a copy of some copyrighted work indefinitely counts as transient.

> I doubt these companies constantly downloading the full training set rather than keeping it in a database somewhere.

Precisely to argue for transient copies, they don't need to keep terabytes of data stored.

>Hard to argue keeping a copy of some copyrighted work indefinitely counts as transient.

You're assuming that they're keeping the works indefinitely, which again is not the case.

> Precisely to argue for transient copies, they don't need to keep terabytes of data stored.

Those kinds of legal workarounds rarely work.

They are dependent persistent access allowing them the equivalent benefit of keeping a persistent copy.

> There’s no question these neural networks and their output are derivative works.

A derivative work is an expressive creation that includes major copyrightable elements of a first, previously created original work (the underlying work).

There is absolutely no agreement that what neural networks do (as a rule) counts as such, so it is not at all correct to say "there is no question..."

If learning how to draw by watching other people draw makes everything you draw a derivative work, then perhaps you have a point.

The network in question recreated the exact content in question on a specific event. What happens is general isn’t the issue, the problem comes from specific output.

For a neural network to be able to recreate a complex work with minimal prompting it must be encode that information and therefore be a derivative work.

There are some ironclad exceptions but they would have to make it through the dysfunctional Congress.

The big one is recipes. Recipes under the current copyright regime in the US are considered non-copyrightable facts, which is why every cookbook and recipe blog has lots of copyrightable splash photos and personal anecdotes. Congress specifically doesn’t want grandmas getting sued for copying the recipe on the box.

> Congress specifically doesn’t want grandmas getting sued for copying the recipe on the box.

Recipes don't have a specific exception within the the copyright law that Congress has carved out.

It is also not cut and dry. It basically boils down to facts not being copyrightable. So a list of ingredients and basic instructions (e.g. cooking time and temperature) won't be granted copyright protection.

But, the prose in the instructions can be copyrighted. So copying a whole recipe verbatim can be copyright infringement, but copying the list of ingredients and writing out the basic instructions is not.

Sounds like a job for LLMs - extract ingredients and steps, then verbalize it back in a completely different style.
But to what end? SEO optimized recipe copy sites already exist and are so numerous to the point where going to specific sites or books is now just a signal of reputability in a sea of trash.
not sure what Congress has to do with a case in the UK

fair use is mostly a US concept, there is no such thing in the UK or most other countries

It seems like UK and EU agree that you cannot copyright a recipe other than maybe the exact way it was written:

https://www.twobirds.com/en/insights/2020/uk/intellectual-pr...

https://www.copyright.eu/docs/protection-of-a-recipe/

Though you can patent novel methods of food production, which is also true in the US.

The root statement is still the same, legislatures can amend copyright laws as they wish if they really care. I don’t know that the UK parliament is exactly functioning well right now, but that’s my impression from across the pond.

> I don’t know that the UK parliament is exactly functioning well right now

in terms of ability to legislate it works considerably better than the US congress

up to you if you call that well functioning

You can only copyright the actual expression of a recipe as a literary work, but the functional aspect, the cake let's say, isn't copyrightable.
> There’s no question these neural networks and their output are derivative works.

Most generated content almost certainly isn’t derivative work by the standards of copyright law. It’s plainly obvious to anybody who’s read Frank Herbert’s books that he derived a lot of ideas from Isaac Asimov, but it’s equally obvious that Dune isn’t a derivative work of Foundation.

If I had some commercial interest in generative AI models, I’d be very happy that everybody is debating the copyright implications. Because copyright law is certainly going to favour the models. The biggest regulatory risk to them as far as I can tell is that they clearly don’t have section 230 protections, and I can’t imagine how that isn’t going to come crashing down around them rather soon.

If you run someone over you can’t defend yourself by saying 99.999% of the time you didn’t run someone over. Most output being free of copyright issues isn’t a defense if any output has those issues.

Specific examples of clear copyright infringement mean that output is a derivative work AND by encoding enough information to recreate it the underlying neural network must itself be a derivative work.

Derivative work has a specific meaning in copyright law, there has to be something in the output, and that's not the case here. Otherwise every single owner of 5 billion images could sue you for your "cat at a cafe" midjourney picture.

Judge Orrick in one of the US cases already called this idea 'nonsense", his words.

Not all outputs are at issue here, but if ANY output is copyright infringement they have problems.

Specific and clear examples of derivative works are shown therefore both those exact examples and the underlying neural network must be a derivative work.