| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kmeisthax 1245 days ago

My biggest critique of OpenRAIL is that it's not entirely clear that AI is copyrightable[0] to begin with. Specifically the model weights are just a mechanical derivation of training set data. Putting aside the "does it infringe[1]" question, there is zero creativity in the training process. All the creativity is either in the source images or the training code. AI companies scrape source images off the Internet without permission, so they cannot use the source images to enforce OpenRAIL. And while they would own the training code, nobody is releasing training code[2], so OpenRAIL wouldn't apply there.

So I do not understand how the resulting model weights are a subject of copyright at all, given that the US has firmly rejected the concept of "sweat of the brow" as a copyrightability standard. Maybe in the EU you could claim database rights over the training set you collected. But the US refuses to enforce those either.

[0] I'm not talking about "is AI art copyrightable" - my personal argument would be that the user feeding it prompts or specifying inpainting masks is enough human involvement to make it copyrightable.

The Copyright Office's refusal to register AI-generated works has been, so far, purely limited to people trying to claim Midjourney as a coauthor. They are not looking over your work with a fine-toothed comb and rejecting any submissions that have badly-painted hands.

[1] I personally think AI training is fair use, but a court will need to decide that. Furthermore, fair use training would not include fair use for selling access to the AI or its output.

[2] The few bits of training code I can find are all licensed under OSI/FSF approved licenses or using libraries under such licenses.

5 comments

nickvincent 1245 days ago

This is a great point.

Not a lawyer, but as I understand the most likely way this question will be answered (for practical purposes in the US) is via the ongoing lawsuits against GitHub Copilot and Stable Diffusion and Midjourney.

I personally agree the creativity is in the source images and the training code, but think that unless it is decided that for legal purposes "AI Artifacts" (the files containing model weights, embedding, etc.) are just transformations of training data and therefore content and subject to the same legal standards as content, I see a lot of value in trying to let people license training and code and models separately. And if models are just transformations of content, I expect we can adjust the norms around licensing to achieve similar outcomes (i.e., trying to balance open sharing with some degree of creator-defined use restriction).

nl 1245 days ago

The co-pilot and Dalle lawsuits aren't about if the training weights file can be copyrighted though (they are about if people's work can be freely used for training).

This is a different issue where the OP is arguing that the weights file is not eligible for copyright in the US. That's an interesting and separate point which I haven't really seen addressed before.

topynate 1244 days ago

The two issues aren't exactly the same but they do seem intimately connected. When you consider what's involved in generating a weights file, it's a mostly mechanical process. You write a model, gather some data, and then train. Maybe the design of the model is patentable, or the model/training code is copyrightable (actually, I'm pretty sure it is), but the training process itself is just the execution of a program on some data. You can argue that what that program is doing is simply compiling a collection of facts, which means you haven't created a derivative work, but in that case the weights file is a database, by definition, so not copyrightable in the US. Or you can argue that the program is a tool which you're using to create a new copyrightable work. But in that case it's probably a derivative work.

nickvincent 1244 days ago

Appreciate the distinction in the above comment that they are two distinct questions, but also agree the two questions are very connected.

I should've been more specific: I was thinking mainly of the artists v. stable diffusion lawsuit which makes the specific technical claim that the stable diffusion software (which includes a bunch of "weights files") includes compressed copies of the training data. (Line 17, "By training Stable Diffusion on the Training Images, Stability caused those images to be stored at and incorporated into Stable Diffusion as compressed copies", https://stablediffusionlitigation.com/pdf/00201/1-1-stable-d...).

I expect that if the decision hinges on this claim, that could have far reaching implications re: model licensing. I think this along the lines of what you've laid out here!

twoodfin 1245 days ago

How would you distinguish “just a mechanical derivation of training set data” from compiled binary software? The latter seems also to be a mechanical derivation from the source code, but inherits the same protections under copyright law.

kmeisthax 1245 days ago

Usually binaries are compiled from your own source code. If I took leaked Windows NT kernel source and compiled it myself, I wouldn't be able to claim ownership over the binaries.

Likewise if I drew my own art and used it as sample data for a completely trained-from-scractch art generator, I would own the result. The key problem is that, because AI companies are not licensing their data, there isn't any creativity that they own for them to assert copyright over. Even if AI training itself is fair use, they still own nothing.

taneq 1245 days ago

Do artists not own copyright on artwork which comprises other sources (eg. collage, sampled music)? It’d be hard to claim that eg. Daft Punk doesn’t own copyright on their music.

(Whether other artists can claim copyright over some recognisable sample is another question.)

kmeisthax 1245 days ago

This is why there's the "thin copyright" doctrine in the US. It comes up often in music cases, since a lot of pop music is trying to do the same thing. You can take a bunch of uncopyrightable elements, mix them together in a creative way, and get copyright over that. But that's a very "thin" copyright since the creativity is less.

I don't think thin copyright would apply to AI model weights, since those are trained entirely by an automated process. Hyperparameters are selected primarily for functionality and not creative merit. And the actual model architectures themselves would be the subject of patents, not copyright; since they're ideas, not expressions of an idea.

Related note: have we seen someone try to patent-troll AI yet?

nl 1245 days ago

It depends.

The Verve's Richard Ashcroft lost partial copyright and all royalties for "Bitter Sweet Symphony" because a sample from the Rolling Stones wasn't properly cleared: https://en.m.wikipedia.org/wiki/Bitter_Sweet_Symphony

Men at Work lost copyright over their famous "Land Down Under" because it used a tune from "Kookaburra sits in the Old Gum Tree" as an important part of the chorus.

rnd0 1244 days ago

>Do artists not own copyright on artwork which comprises other sources (eg. collage, sampled music)? It’d be hard to claim that eg. Daft Punk doesn’t own copyright on their music.

Agreed. By that logic, William S Burroughs wouldn't own his best novels: https://en.wikipedia.org/wiki/Cut-up_technique

taneq 1245 days ago

“Mechanical derivation” is doing a lot of heavy lifting here. What qualifies something as “mechanical”? Any algorithm? Or just digital algorithms? Any process entirely governed by the laws of physics?

kmeisthax 1244 days ago

So, in the US, the bedrock of copyrightability is creativity. The opposite would be what SCOTUS derided as the "sweat of the brow" doctrine, where merely "working hard" would give you copyright over the result. No court in the US will actually accept a sweat of the brow argument, of course, because there's Supreme Court precedent against it.

This is why you can't copyright maps[0], and why scans of public domain artwork are automatically public domain[1][2]. Because there's no creativity in them.

The courts do not oppose the use of algorithms or mechanical tools in art. If I draw something in Photoshop, I still own it. Using, say, a blur or contrast filter does not reduce the creativity of the underlying art, because there's still an artist deciding what filters to use, how to control them, et cetera.

That doesn't apply for AI training. The controls that we do have for AI are hyperparameters and training set data. Hyperparameters are not themselves creative inputs; they are selected by trial and error to get the best result. And training set data can be creative, but the specific AI we are talking about was trained purely on scraped images from the Internet, which the creator does not own. So you have a machine that is being fed no creativity, and thus will produce no creativity, so the courts will reject claims to ownership over it.

[0] Trap streets ARE copyrightable, though. This is why you'll find fake streets that don't exist on your maps sometimes.

[1] https://en.wikipedia.org/wiki/Bridgeman_Art_Library_v._Corel....

[2] Several museums continue to argue the opposite - i.e. that scanning a public domain work creates a new copyright on the scan. They even tried to harass the Wikimedia Foundation over it: https://en.wikipedia.org/wiki/National_Portrait_Gallery_and_...

cwkoss 1245 days ago

Is the choice of what to train upon not creative? I feel like it can be.

kmeisthax 1244 days ago

Possibly, but even if that were the case, it would protect NovelAI, not Stability.

The closest analogue I can think of would be copyrighting a Magic: The Gathering deck. Robert Hovden did that[0], and somehow convinced the Copyright Office to go along with it. As far as I can tell this never actually got court-tested, though. You can get a thin copyright on arrangements of other works you don't own, but a critical wrinkle in that is that an MTG deck is not merely "an arrangement of aesthetically pleasing card art". The cards are picked because of their gameplay value, specifically to min-max a particular win condition. They are not arrangements, but strategies.

Here's the thing: there is no copyright in game rules[1]. Those are ideas, which you have to patent[2]. And to the extent that an idea and an expression of that idea are inseparable, the idea part makes the whole uncopyrightable. This is known as the merger doctrine. So you can't copyright an MtG deck that would give you de-facto ownership over a particular game strategy.

So, applying that logic back to the training set, you'd only have ownership insamuch as your training set was selected for a particular artistic result, and not just "reducing the loss function" or "scoring higher on a double-blind image preference test".

As far as I'm aware, there are companies that do creatively select training set inputs; i.e. NovelAI. However, most of the "generalist" AI art generators, such as Stable Diffusion, Craiyon, or DALL-E, were trained on crawled data without much or any tweaking of the inputs[3]. A lot of them have overfit text prompts, because the people training them didn't even filter for duplicate images. You can also specifically fine-tune an existing model to achieve a particular result, which would be a creative process if you could demonstrate that you picked all the images yourself.

But all of that only applies to the training set list itself; the actual training is still noncreative. The creativity has to flow through to the trained model. There's one problem with that, though: if it turns out that AI training for art generators is not fair use, then your copyright over the model dissolves like cotton candy in water. This is because without a fair use argument, the model is just a derivative work of the training set images, and you do not own unlicensed derivative works[4].

[0] https://pluralistic.net/2021/08/14/angels-and-demons/#owning...

[1] Which is also why Cory Doctorow thinks the D&D OGL (either version) is a water sandwich that just takes away your fair use rights.

[2] WotC actually did patent specific parts of MTG, like turning cards to indicate that they've been used up that turn.

[3] I may have posted another comment in this thread claiming that training sets are kept hidden. I had a brain fart, they all pull from LAION and Common Crawl.

[4] This is also why people sell T-shirts with stolen fanart on it. The artists who drew the stolen art own nothing and cannot sue. The original creator of that art can sue, but more often than not they don't.

kaoD 1245 days ago

> nobody is releasing training code

Interesting. Why is this happening?