| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rickmode 1063 days ago

I believe we first need to answer the question of whether the copyright of the AI model’s source text or images affects the output.

My opinion — and note I’m a software engineer, not a lawyer — is that an AI, being a statistical model and not generally intelligent, should not be allowed to disregard the copyright of its source material. This would, I think, require the AI’s creator to secure a license for all of its sources that allows this sort of transformation and presentation. And further, a user of the AI would themselves require a license to use the output.

The alternative seems to be “anything goes”.

23 comments

Nevermark 1063 days ago

I don’t think it makes sense for both model builders and the model’s users to separately obtain licenses for the same works used in the training set.

A model trained on several copyrighted data sources cannot somehow be used in a way depending on a subset of those sources.

So all parameters of usage and compensation should be settled by contract between the model builder and copyrighted data supplier, before the copyrighted material is used.

Or to put it simply: using copyrighted material to create a model would NOT be considered fair use.

That’s it. That’s the standard. No complicated new laws required.

Model builders obtain permission to use copyrighted material from copyright holders based on any terms both agree to.

Terms might involve model usage limits, term limits, one time compensation, per use compensation, data source credits, or anything else either party wants.

The likely result will be some standard sets of terms becoming popular and well known. But nobody has to agree to anything they don’t want to.

kuchenbecker 1063 days ago

I slightly disagree, in that I think the person using the tool should bear the burden of copyright. I.e. if the model outputs something under copywrite it merely can't be republished. In this same way, i can use Photoshop on proprietary data but I can't necessarily sell the results.

dumpsterdiver 1063 days ago

I'm so torn. On one hand, what you suggest seems to be a nearly ideal balance between advancing scientific progress and legal liability. By placing the legal burden to publish generated works on the person actually trying to publish, it allows for a more nuanced legal approach (i.e. the difference between "there are similarities to this work, but it's murky" or "you %100 stole that work").

On the other hand, is the company running the model themselves not already publishing all of that work and profiting from it? It seems unfair that their bottom line gets to be bolstered because they can produce work based on any artist, whereas the consumers of that work may need to end up walking on egg shells in order to publish them.

Like I said, I'm torn as far as how it "should be". I know how I want it to be though. I would love if AI continued training unabated. The results have been amazing, and I believe it would be a shame if the effort was slowed down by legislation.

chii 1062 days ago

> is the company running the model themselves not already publishing all of that work and profiting from it?

no, because the model is transformative enough that it cannot be said to be a derivative works of the training set.

The model is in essence a form of distilled information, extracted from the training set. Information cannot be copyrighted - only expressions can.

Therefore, a model producer should have the right to use any pre-existing work, in the same way a person can, to study and internally memorize and extract information.

The reproduction of any of the training set data constitutes a copyright violation, but this is not done by the owner of the model, but by an end user of the model.

dumpsterdiver 1050 days ago

My point is that if a court finds that a generated image is indeed similar enough to constitute an infringement when a subscriber of for instance MidJourney attempts to publish it, has that work not already been "published" to the subscriber? And has MidJourney not profited by gaining a subscriber based on the work of others?

haldujai 1063 days ago

I wonder if that analogy represents the same thing. Speaking purely from a non-legal perspective on the ethics in my mind:

When you use Photoshop on propriety data you're providing the original data and choosing what manipulation to make (i.e. what tool) and directly creating the output. It makes sense that if you redistribute this it may be copyright violation.

When you use Copilot or ChatGPT for programming you're typically asking a non-proprietary question or accepting suggestions it's making based on non-proprietary (or proprietary to you) code in the file. You also don't dictate the manipulation process a black box deep learning model does (i.e. I haven't asked it to do something that could be reasonably thought to be a copyright violation).

Am I then responsible for the fact that Copilot is fooling me with effectively copy-pasted copyrighted code when it's being presented to me as generated by the software and I haven't instructed the software to commit a copyright violation? I'm not sure if intent matters for copyright, I assume it doesn't but perhaps that's a missing piece to this.

Diffusion models are gray to me, if you're asking/prompting with "Mickey Mouse riding a horse" I can see the argument that the prompt itself can be interpreted as asking the model to commit copyright violation and the user is just hiding behind a layer of abstraction. If I ask the model to spit out "a picture of a smiling cartoon woman" and it generates a Betty Boop lookalike is that still the users fault?

It seems to me like passing the burden to the user could be reasonable but would need some safe harbor type of exception. It'll be really interesting to see what the courts decide.

8n4vidtmkvmk 1063 days ago

I see 2 problems with that.

(1) how do you know if the image that just generated is substantially similar to an existing copyright work? Maybe if some registration tool existed, but other wise the burden is too great

(2) what is stopping someone from generating millions of images and copy righting all the "unique" ones? Such that no one can create anything without accidental collisions.

gwd 1063 days ago

> how do you know if the image that just generated is substantially similar to an existing copyright work?

This is already a problem with biological neural nets (i.e. humans). I remember as a teenager writing a simple song on the piano, and playing it for my mom; she said, "You didn't write that -- that's Gilligan's Island!" And indeed it was. If I had made a record and sold it, whoever owned the rights to the Gilligan's Island theme song could have sued me for it, and they would (rightly) have won.

There's already loads of case law about this; the same thing would apply to AI.

> what is stopping someone from generating millions of images and copy righting all the "unique" ones? Such that no one can create anything without accidental collisions.

Right now what's stopping it is that only humans can make copyrightable material; whatever is spat out from a computer is effectively public domain, not copyrighted.

jasrys 1063 days ago

1. lots of established law and case law (at least in the US), this is already a well-settled problem and folks have the tools and proper venue to bring infringement claims. Yes, federal copyright infringement litigation is prohibitively expensive for many issues. There is a now a "small claims court" for smaller issues. [1]

2. Those works cannot be copyrighted (at least in the US). [2]. And hey, someone already tried copyrighting every song melody [3]

[1]: https://copyright.gov/about/small-claims/

[2]: https://www.federalregister.gov/documents/2023/03/16/2023-05...

[3]: https://www.youtube.com/watch?v=sJtm0MoOgiU

Nevermark 1063 days ago

But that problem is already solved.

Copyright holders are already protected from (I.e. can legally prohibit) distribution of obvious copies, or clearly derivative works.

Regardless of how they were produced by hand, copy machine, Photoshop or with a model.

The new problem is that artists styles are being “stolen” by incorporating their copyrighted work into models without their permission.

And that problem can easily be solved if using copyrighted material to create models is declared NOT fair use.

Artists could still allow models to be built from their work, but on their terms. If they wish to do that.

A famous artist, that doesn’t mind being commercial, could sell their own unique model to let fans create art in that artist’s style, while not having their style “ripped” by others.

Or just keep their style to themselves, for their own work, as artists have done for centuries.

(Of course, with greater effort, their style could still be recreated - styles are not protected unless they are trademarked - but the recreation would have to be done without using the artist’s copyrighted works.)

jasrys 1063 days ago

This is probably a somewhat unpopular opinion on HN, but it is where many of the artists I work with are generally trying to get to. Consent, compensation, and credit.

Nevermark 1062 days ago

> Consent, compensation, and credit.

I just want to quote you. Nothing I need to say. That’s it.

readyplayeremma 1063 days ago

This is the best path forward I think. And it will become increasingly sensible as things continue to evolve. AI wasn't necessary to violate copyright before, and it isn't necessary today.

The determination of copyright violation should be made against the output of the model in the event that someone uses it for commercial purposes.

If the models have a risk of generating copyrighted content, it will be up to the consumers of the system to mitigate that risk through manual review or automated checks of the output.

xorbax 1063 days ago

A divergence, but I see a lot of posters asserting that "humans learn by copying other people, but we don't call that a violation of copyright when they draw"

People casually asserting that software is equivalent to humanity will be a non-negligible thing to consider, as irritating and poorly-founded as it seems.

If the reproduction isn't pixel-perfect, but merely obvious and overwhelming, how do you refute that philosophically to people who refuse a distinction between 50GB and a human life?

readyplayeremma 1063 days ago

> People casually asserting that software is equivalent to humanity will be a non-negligible thing to consider, as irritating and poorly-founded as it seems.

> If the reproduction isn't pixel-perfect, but merely obvious and overwhelming, how do you refute that philosophically to people who refuse a distinction between 50GB and a human life?

Software equivalence to humanity is a very philosophical question that many sci-fi writers have approached. But our primary issue related to this technology does not depend on anyone making a determination there.

The challenge is that losses to livelihood from this technology are going to come from far broader impacts than copyright alone. Copyright disputes are just the first things to get everyone's attention.

Let's say we err on the side of protection of copyright, and all training data must be fully licensed, in addition to users being responsible for ensuring outputs did not accidentally reproduce something similar to a copyrighted work, even if it was part of the licensed training dataset. Great! This fixes the problem of lost value for the owners of copyrights. Companies will face a slight delay and slightly increased costs as they license content; however, in the end, model capabilities will be the same and continue to increase.

The number of jobs that actually cannot be performed without humans will continue to dwindle — livelihoods will be lost at essentially the same scale despite upholding copyrights.

The only way we can handle a technology capable of reducing most need for human labor is by focusing on planning and executing a smooth transition toward an economy with more people than jobs — aiming for minimal human suffering during this process.

A mass loss of human jobs does not need to mean a mass loss of livelihood if our society is prepared to transition to a universal basic income. After all, human life is far more than just a job. We have the opportunity for much more fulfilling lives if we plan this transition well. We must understand that this is a far larger issue than copyright - copyright disputes are just one of the first symptoms of this disruptive process.

JamesBarney 1063 days ago

A human is still entering the prompt to generate the possibly copyrighted image/text. I don't think copyright law should care about the implementation. It's ok to copy a style if you use paint brushes or photo shop. But not ok if you use a statistic model?

LegitShady 1063 days ago

Apply for a copyright on your human authored prompt then. That's the extent of human authorship.

gwd 1063 days ago

> Or to put it simply: using copyrighted material to create a model would NOT be considered fair use.

The more I think about it, the more something along these lines seems like it might be the right way to think about it.

When you play a DVD, for example, you copy the bits off the DVD, into the memory of your DVD player, and onto your screen; this is all explicitly considered "fair use" copying. But if you then copied those fair-use bits off the screen onto a thousand other screens, that violates copyright.

When you, as the human watch the DVD, bits of it get copied into your brain; but you don't then copy the bits of your brain to millions of other people -- they each have to make their own copy.

We could make the law for LLMs follow a similar logic: That having an LLM watch a video or read a text is similar to having a DVD player read a DVD or a web browser copy information from a website. It's good for that limited use case, but the resulting copy cannot be copied again without a license.

This would allow (say) researchers, or even individuals, to do their own training and so on without a license; but when anyone wanted to create something that they wanted to scale up, they'd have to get licenses for everything.

That would fundamentally keep things balanced as they are now with creators and other creators. The big problem isn't that a handful of other creators may be copying their style; that growth in competition is limiting because of the expense of duplication. It's that millions of electronic engines can copy their style.

judge2020 1062 days ago

> When you, as the human watch the DVD, bits of it get copied into your brain; but you don't then copy the bits of your brain to millions of other people -- they each have to make their own copy.

If you ripped The Little Mermaid, redrew every frame to combine it with The Fresh Prince of Bell-Air and moved things around in scenes to make it look like Ariel is Will Smith responding to sit-com dialogue, then it'd be fair use, regardless of how many people you show this new version to.

Fair use isn't about how or why you're doing with something. The definitions for fair use are very clearly laid out at https://www.law.cornell.edu/uscode/text/17/107

kelnos 1063 days ago

> I don’t think it makes sense for both model builders and the model’s users to separately obtain licenses for the same works used in the training set.

I'm torn on who should pay, and where and when. In the world of patents, there's often an option/split. Say a chip manufacturer wants to build H265 decoding into their hardware. The chip manufacturer could buy the license. Or the purchaser (who probably is building some sort of board or device around the chip) could pay for the license. Or they could disable that functionality in the end product, and the consumer could pay for a license (or not, if they don't care about that feature).

The most common is usually the middle option: the end-device manufacturer (or brand that eventually sells the product) will pay for the license.

But I'm not sure if this works all that well for an AI model. With hardware, the license is usually paid per unit. It's easy to see that one chip = one license. If the model builder buys a license, that model could be used one time or 100 million times. Tracking use like that probably isn't all that practical, but I think it's safe to say that a 100-million-use model should probably pay more for a license than a single-use model.

So maybe the model builder should be responsible for attaching a comprehensive "copyright history" to the model, and users should have to pay for a license based on their use? Again, not sure how to track that. But I guess general software licensing has similar problems when you can "hide" usage.

Retric 1063 days ago

Yes, someone using a model can’t know if the generated text/image/sound is a nearly identical copy of the original material they don’t recognize. If use of the output of these systems comes at significant legal risk then then such systems become nearly useless.

chii 1063 days ago

> if the generated text/image/sound is a nearly identical copy of the original material they don’t recognize

how does the industry today deal with artists that "copy" off some other works? This isn't a problem with AI at all - just that AI provides a tool to generate such works faster.

skydhash 1063 days ago

Someones comes to me to ask for a drawing of Batman or to write an erotic story around Supergirl. I can do it, but I cannot claim ownership over the characters. And I think I will quickly get a letter from DC or Marvel if I try to do this at scale.

chii 1063 days ago

> I can do it, but I cannot claim ownership over the characters.

of course not. But you can claim ownership if you don't call those characters their original names, and make sufficient changes to the design (how sufficient is determined by a court of law - thus expenses).

> DC or Marvel if I try to do this at scale.

The show 'invincible'[1] has a character that is a basic copy of superman. And yet, you will find that they don't get a letter from DC.

[1] https://en.wikipedia.org/wiki/Invincible_(TV_series)

skydhash 1063 days ago

> make sufficient changes to the design

I think that’s one of the issue. The transformation done by these tools are mechanical. Even if it may be extensive. The human input is too small. Omniman may have similarities with Superman, but he is not him in the larger context of the story. LLMs can not yet be that consistent for marketable output that deserves to be copyrightable.

I’m perfectly fine for LLMs to aid with spell checking and alternative phrasing (image is a grayer area). Bu the ideas of prompts and prompt output being copyrightable is something I oppose.

Retric 1063 days ago

The difference is the artists assertion that it’s either original or a copy from something else. DALLE 2 can’t tell you if it’s original or not. These AI’s have no idea and the company or group that created them doesn’t review individual output so they can’t say either.

chii 1063 days ago

> DALLE 2 can’t tell you if it’s original or not

whoever pressed the button to run DALLE will make the assertion, just like whoever was running photoshop to make the image today would make the same assertion.

Retric 1063 days ago

Based on what?

A photoshop user controls what data photoshop uses, a DALLE user doesn’t. Even a prompt as generic as “Cat” could be producing an obviously derivative work if you compare it to the original. This is true for all prompts.

mjan22640 1063 days ago

The generated content is a derivative work of each piece of the material the model was trained on. That material can be listed.

Retric 1063 days ago

So your suggestion is to list 100’s of millions of works and have users manually review them? I don’t think that’s going to work.

renonce 1063 days ago

Problem is, how can you determine if the model contains copyrighted material? The laws governs copyright through ownership, so in order to claim copyright infringement you have to be able pinpoint a specific person and prove that their work is somehow embedded in the gradients, which is not practically possible at the point. It's just like how you can't practically enforce copyright on encrypted data unless you ban encryption altogether.

haldujai 1063 days ago

1. If you know your copyrighted material was in the training dataset is that not sufficient?

2. From a legal perspective do you actually have to prove it's embedded in the gradients? If I draw an exact copy of Mickey Mouse from memory and sell it I didn't think Disney had to prove I've ever actually seen Mickey Mouse before or point to where the image of him is embedded in my brain.

cwkoss 1063 days ago

Disney has a trademark on Mickey mouse, but that does not mean that they automatically get copyright on all pictures of Mickey Mouse drawn by others (they don't)

haldujai 1063 days ago

Bad example on my part in that case. I thought some art is copyrighted or am I mistaken? If so replace Mickey Mouse with something copyrighted

meowkit 1063 days ago

My opinion as a SWE who is dating a lawyer (joke, not a serious qualification but it does provide some insight):

Generative models traverse and interpolate high dimensional state spaces. These state spaces are created from input data.

I would argue people do the exact same thing - the first main difference is we can use novel inputs (e.g. we can use images or words to develop our music/temporal state spaces and vice versa). People also are recursive and self referential in a way that doesn't collapse.

Until we solve the interpretability problem (e.g. can you decode the feature space of a neural network into something we can comprehend) there is no good solution. Either traditional copyright wins and we get even more draconian policies (think Disney and their desire to never put anything in the public domain), or we have a free for all (which I don't think is bad for creative works, but certainly for more practical things like stock photos or nonfiction).

cj 1063 days ago

I can appreciate how this line of thinking might be attractive.

But IMO the human<>machine comparison doesn't lend itself much credence. We shouldn't assume that just because a human is allowed to do something, a machine is automatically allowed to do the same thing, too. I think some care should be taken when considering if we allow machines to have the same privileges as humans.

hexage1814 1063 days ago

> We shouldn't assume that just because a human is allowed to do something, a machine is automatically allowed to do the same thing, too

There are no sentient machines (at least yet). Your position is one where you are actually limiting what other humans can do, limiting which tools can other humans have access to. Also, the parameter – according to the law – was always "the same". For instance, there is nothing preventing you from making your own chess league where computers are allowed to compete. FIDE is free to ban you from compete own their leagues or to ban anyone associate with your league or whatever, but there is nothing in the law preventing you.

I have been saying this from the day one: this whole debate it's mainly white-collar workers negatively impacted by automation making up any excuse they can to say why their job should be protected, somehow, for some reason, but not the one of coal miners or what have you.

A human downloads a photo to learn how to draw. Another human downloads a photo to teach their computer how to draw. No difference, no need to obtain any license in any of the cases.

raincole 1063 days ago

> We shouldn't assume that just because a human is allowed to do something, a machine is automatically allowed to do the same thing, too.

Generally speaking, even one machine can do something, it doesn't automatically mean another machine is allowed to do that.

For example you can drive car with a normal driving license, but not a truck. In some states you can own a pistol but no automatic rifle.

hexage1814 1063 days ago

It also depends on where this happening. For instance, you don't need a license to drive a car inside your own private propriety. You need a license to drive it on public streets because society needs some assurance that you know what you are doing. So in many cases the laws and restrictions also happen in relation to a given scenario.

qaq 1063 days ago

copyright exists among other things to "promote the progress of science and useful arts".

freejazz 1063 days ago

That section is written in parallel verse, with copyright <> science, and patent <> useful arts. This sounds weird, now, but it's consistent with the use of the words at the time, which is the reverse of how they are used today, where paintings etc are considered art, and inventions are considered science. So, it's not that copyright exists to promote science and art (as we call them today) but only just the arts. Patents are for science. Authorship reflects copyright and invention reflects patent:

> Congress shall have the power... To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries.”

Paradigma11 1063 days ago

A machine is just a tool. It is the creator and the user of the machine that has the privileges he uses the tool with. I think we should be careful not to anthropomorphize, attribute agency, responsibility and autonomy to something that is essential a better photoshop plugin.

klabb3 1063 days ago

I don’t think parent anthropomorphizing anything. The ones who anthropomorphize are saying that machines should be covered by fair use, because they have similarities with humans.

This is not about the rights of a machine but about how one human product is consumed by another human product. This is just a commercial supply chain: if you make a model, you need human data. You generally need to compensate your suppliers of “raw material”.

Paradigma11 1063 days ago

Its not the tool that is covered by fair use. It is the creation of the tool that is covered by fair use.

Is the tool itself supposed to be a copyright violation or is it a tool facilitating copyright violation by producing violating output?

The later is something that can be tested because we have processes to compare works of art for it. If it is shown that LLMs produce mostly infringing art then we can and should ban or heavily regulate them. If not then not.

klabb3 1062 days ago

> It is the creation of the tool that is covered by fair use.

Copyright doesn’t restrict creation of something, it restricts (mainly) commercial distribution. Research, education and journalism etc are largely unaffected, and would still be.

That said, I believe that selling access to the tool to the public already violates the copyright of the rights holders, even if it doesn’t produce similar works of art. The copyrighted works increased the value of the product (otherwise why would they use it?).

> The later is something that can be tested because we have processes to compare works of art for it.

This is the most expensive, least practical and most arbitrary part of existing copyright. It would be a huge mistake, imo, to expand this dramatically. This problem mostly goes away if the supply chain is sanely regulated.

All you’d need is give access to the training set upon audit, and bureaucrats could check for copyrighted works. There are already automated tools for this.

bionhoward 1063 days ago

“It’s just a machine!”

So are you!

freejazz 1062 days ago

Don't be obtusely misanthropic

mjan22640 1063 days ago

The value of copyright is going to vanish. There is enough public domain material to train models on and to avoid the problem altogether.

There used to be professions like tinkerers, bards, clowns. The tinkerers disappeared when the society became modern. The clowns on the other hand managed to lobby for laws that put people into jail for heinous crimes like copying pictures, and survived longer. They are going to bite the dust now.

freejazz 1062 days ago

What you describe would result in the opposite - copyright will be incredibly valuable in a system where the vast majority of "creative works" are just regurgitations of past works in the public domain, churned out by machines. In such a world, none of that has a copyright anyway. Actual creative works, which do garner copyright, will then be that much more valuable, because they will continue to be a property right with a breadth of coverage to make them useful.

rcme 1063 days ago

Whether or not “humans do it” isn’t relevant. You can walk around with a copyrighted song in your head. That is not copyright infringement. But if you take that song, create a digital copy, and distribute it for money, then you are violating someone’s copyright. Additionally, our legal system requires a balance of probabilities. It’s hard to prove that someone was influenced by another work unless the similarities are plainly obvious. The same does not apply to ML models where the training data and algorithm are knowable facts.

gremlinsinc 1063 days ago

I challenge you to listen to 4 chords of awesome and tell me again about how every song is completely original. How does eragon exist when it's definitely ripped parts from star wars, etc...ai usually doesn't spit out a full plagiarism, but a loosely inspired work which is what most media we consume is.

Edit: 4 chords of awesome link is https://youtube.com/watch?v=oOlDewpCfZQ&si=8vL6PbDnHiaffJh3

freejazz 1062 days ago

A copyright in just Eragon would be incredibly thin, for the exact reasons you state. This criticism of copyright by people that have no understanding of actual copyright law, how it works, how its used, etc, is so exhausting and ignorant.

rcme 1063 days ago

“Every song is completely original” is the opposite of what I said.

distract8901 1063 days ago

The analogy doesn't hold when you consider the sheer scale of the problem.

I can outright buy a machine for a few thousand dollars that can crank out a faithful rewrite of every Stephen King novel without the shitty endings and nonsense plot points. It can do it in a few days, maybe a couple of weeks at most.

To do that with human labor would take years and cost hundreds of thousands, if not millions of dollars.

Instead of paying an artist a couple hundred for a commissioned drawing, I can just scrape up their entire portfolio and generate any image I want with their style. I can generate hundreds or thousands of images. I can take their distinct style and use it exclusively as the branding for my company.

What a ML model does is very fundamental not what happens when a human draws inspiration from prior art. A human would require an extremely significant amount of time and resources to perfectly imitate every artist they have ever seen. It takes a human significant time and resources to produce faithful variations on prior art.

A ML model is measured in words or images per second.

CrimsonRain 1063 days ago

Hello.

Maintaining a system like Netflix or AWS or even Amazon will require insane amount of people and time, if possible at all within a finite time, without all the computers doing work for us in seconds that would take humans ages to do.

omnicognate 1063 days ago

> ... a SWE who is dating a lawyer

> I would argue people do the exact same thing

Perhaps a ménage à trois with a neuroscientist would change your view on this.

ethbr1 1063 days ago

> Until we solve the interpretability problem (e.g. can you decode the feature space of a neural network into something we can comprehend) there is no good solution.

This is the rub. Without reverse attribution... open source anonymous models become a free-for-all loophole.

Since that doesn't currently exist, I think the best we can do is to say that any commercial entity using a model bears the responsibility of proving the model they use is untainted by copyrighted material (to which they haven't secured rights).

Open source model X is... whatever it is.

But I'll be damned if OpenAI / Meta / Microsoft / IBM should be able to build a commercial product on top of laundered copyrighted material while ignoring provenance.

I mean, we have models for this: software code and art. Both aren't clearly attributable. In the case of software code, we've developed case law around clean room design and similarity. In the case of art, we value verifiable chain of custody.

Hopefully, something similar would tilt commercial funding of AI in the direction of responsible use.

Natsu 1063 days ago

My problem with this is that artists learn by studying other artists, cutting that off because it's AI rather than focusing on whether the resulting work is derivative, seems more of a problem to me. It seems to me that an AI can be used for either original work or derivatives, proving that you can get derivatives out of it has always struck me as no different than commissioning a copy of someone's work from a human artist and being shocked that you got what you asked for.

freejazz 1063 days ago

Can an AI express to you how van gogh affected it as an artist? I'm not sure that AI is "learning" the way we say humans are "learning," when humans learn and study art. Obviously there is no debate that you can input van gogh into a model and produce something van gogh-like as a result. But I've not seen anything that indicates that the AI is learning anything about van gogh at all. Perhaps it comes down to whether you think learning van gogh is just creating a mapping of all of his brush strokes ever, and only exactly what they look like. It's obvious the AI knows nothing more than that. If you think that's what humans do when they learn art, I'd be sad for you!

As to your hypothetical, we don't give copyrights to people who make rote copies of things, human or otherwise. Is the implication of the shock, that there is sufficient difference with the work as to render it a derivative and not a copy? Okay, how so? And of what consequence? Making derivatives of a copyright without license is infringement.

Natsu 1063 days ago

I think it's learning styles in a way that's at least partially analogous, because it comes out with things that are reasonably original and not in the training data.

I'm sure an LLM can write you an essay like that for any artist you want, but I'm not all that convinced those are meaningful even with humans.

> As to your hypothetical

That's the thing, it's not a hypothetical, it's a past story from here on HN. Someone did that, asking for copies of a famous painting (Girl with a Pearl Earring) and got highly derivative items out of the model and we had a debate over whether that even means anything, because that's both a simple description of the painting and the name of a famous work, so it makes it so it can be ambiguous whether you asked for "Girl with a Pearl Earring" or a girl with a pearl earring in the prompting.

I agree that it looks like copyright infringement whether it's done by a human or AI, though. I guess a lot of people missed the prior discussion on HN.

freejazz 1063 days ago

>I think it's learning styles in a way that's at least partially analogous, because it comes out with things that are reasonably original and not in the training data.

I don't think that is evidence that what it is doing is "learning".

>I'm sure an LLM can write you an essay like that for any artist you want, but I'm not all that convinced those are meaningful even with humans.

Well, it wouldn't be reflective of what the LLM thinks, so what is your point? If you are of the belief that humans don't have thoughts, I guess it's not a surprise you view things this way.

>That's the thing, it's not a hypothetical, it's a past story from here on HN. Someone did that, asking for copies of a famous painting (Girl with a Pearl Earring) and got highly derivative items out of the model and we had a debate over whether that even means anything, because that's both a simple description of the painting and the name of a famous work, so it makes it so it can be ambiguous whether you asked for "Girl with a Pearl Earring" or a girl with a pearl earring in the prompting.

You say derivative but without any reference to what it actually means... what about is derivative - that's the analysis that's happening in court. The analysis isn't "what you asked the LLM" because that's not dispositive to whether or not something is a copy.

>I agree that it looks like copyright infringement whether it's done by a human or AI, though. I guess a lot of people missed the prior discussion on HN.

Sorry I don't read every single thread about copyright on HN? This is the second posting I've seen on the RFC today. Give me a break!

Natsu 1062 days ago

> I don't think that is evidence that what it is doing is "learning".

When I say learning I mean something like "gaining new ability by studying how others did the same task, resulting in being able to produce novel output." I'm not quite sure what you are using the word to mean here, though I might agree that there are differences between what AIs do and what humans do, the question being what they are and whether they're important here.

I don't claim to know anything about the internal experience (if any) of an LLM writing such an essay and I can't really reason about that because I've never been an LLM, whereas I can at least relate to human experience. I think your assertion that it "wouldn't be reflective of what the LLM thinks" is a bit like saying that you don't think submarines are actually "swimming," as the saying goes, though. It may not "think" in human terms as we do, but it's certainly doing some kind of calculation that produces an equivalent output, so I have a lot of questions about whether we can say that on principle. We're well past passing the Turing test for a lot of things, either the original or censored form, these questions are getting less academic by the day.

> You say derivative but without any reference to what it actually means

We're talking about copyright law, so the meaning of derivative was borrowed from that, i.e. that AI model was producing works that could be reasonably thought to have infringed on the copyright of that painting when prompted for "a girl with a pearl earring" and this was held up to mean that AIs are just regurgitating training data and are therefore implicitly missing something essential to being an artist or what have you and all their work should be considered derivative works of the training data as far as copyright law is concerned.

Meanwhile, I'm saying that I think the AI should be judged about like a human artist would be to argue against the people who seem to want to say that the AI can't take input from copyrighted things without all of its output being tainted forever. We have no such requirement for humans and I don't see why it makes sense to add this new restriction on AIs specifically.

> Sorry I don't read every single thread about copyright on HN?

I'm not faulting you for not knowing, I'm faulting myself for assuming too much context and just trying to explain what I had in my head when writing that so you could understand how I came to think that. Hopefully this lets you see where I'm coming from.

freejazz 1062 days ago

>When I say learning I mean something like "gaining new ability by studying how others did the same task, resulting in being able to produce novel output." I'm not quite sure what you are using the word to mean here, though I might agree that there are differences between what AIs do and what humans do, the question being what they are and whether they're important here.

I think the dictionary definition is more than sufficient: "the acquisition of knowledge or skills through experience, study, or by being taught." This is what I mean by running with your own made up definition.

>I don't claim to know anything about the internal experience (if any) of an LLM writing such an essay and I can't really reason about that because I've never been an LLM, whereas I can at least relate to human experience. I think your assertion that it "wouldn't be reflective of what the LLM thinks" is a bit like saying that you don't think submarines are actually "swimming," as the saying goes, though. It may not "think" in human terms as we do, but it's certainly doing some kind of calculation that produces an equivalent output, so I have a lot of questions about whether we can say that on principle. We're well past passing the Turing test for a lot of things, either the original or censored form, these questions are getting less academic by the day.

You are the one redefining words like "think" and "experience" not me. I'm not playing that game at all. After all, you are the one that is equivocating these processes between humans and AI by coming up with your own, much more broad concoctions.

>We're talking about copyright law, so the meaning of derivative was borrowed from that, i.e. that AI model was producing works that could be reasonably thought to have infringed on the copyright of that painting when prompted for "a girl with a pearl earring" and this was held up to mean that AIs are just regurgitating training data and are therefore implicitly missing something essential to being an artist or what have you and all their work should be considered derivative works of the training data as far as copyright law is concerned.

I'm familiar with copyright law, I'm not sure you are. A work can be derivative in a number of ways, some are legal, some aren't. It's not a new thing that some uses by a machine can be infringing, and others, non-infringing. Why now must it be that machines should be analyzed the same as humans all of the sudden?

>Meanwhile, I'm saying that I think the AI should be judged about like a human artist would be to argue against the people who seem to want to say that the AI can't take input from copyrighted things without all of its output being tainted forever. We have no such requirement for humans and I don't see why it makes sense to add this new restriction on AIs specifically.

Yes, I understand that. But I asked why it should be judged as a human, and you are saying because it "learns". But that's only based upon your re-defining the concept of learning in order to make it inhuman. The only reasonable arguments I've seen that AI outputs should be copyrightable are based on them being a tool that an artist can use. What you are saying is just dressed up anthropomorphization.

skydhash 1063 days ago

You can ask someone to produce a pin-up version of Minnie Mouse, but good luck using it in any commercial activities.

Most LLMs are just profiteering from people’s labor without their consent. And there’s nothing new being produced. It’s always a statistical output of previous works.

gwd 1063 days ago

> You can ask someone to produce a pin-up version of Minnie Mouse, but good luck using it in any commercial activities.

The same would automatically apply to LLM output -- there's no need to change the current laws to cover that case.

The question is this. Suppose I ask a human artist and an LLM to create me a new female mouse cartoon character. And suppose both the artist and the LLM have been exposed to Minnie Mouse. It's not unlikely that the new character created in both cases will have aspects specifically similar to, or specifically opposite to Minnie Mouse.

In the case of the human artist, the new character will not be covered by Disney's copyright, unless there was a lot of copying. Why should the result be different for LLMs?

The logical conclusion of "any output of an LLM that's seen Minnie Mouse must be subject to Disney's copyright" is "any output of any human that's seen Minnie Mouse must be owned by Disney". Which I'm sure Disney would love, but would certainly make the world a worse place for everyone.

chii 1063 days ago

> a pin-up version of Minnie Mouse

that's not because of copyright, but because of trademark. If you make the minnie mouse sufficiently different that it cannot be mistaken for not being Minnie to the average person, and don't call it minnie mouse (to get rid of trademark), disney will have a much harder time suing you. Of course, they will still try, and steam roll you with just money instead.

JamesBarney 1063 days ago

> And there’s nothing new being produced. It’s always a statistical output of previous works.

I don't think you can define those terms such that what you say is true of AI but not true of people.

Natsu 1063 days ago

I think you're misunderstanding that, I don't expect it in either case, I'm saying you have to judge the output not the input. So even if it trained on a ton of copyrighted artwork, if the output isn't a ripoff of something in the training data, I don't think there should be any copyright issues.

idle_zealot 1063 days ago

Is intelligence really a factor here?

Say I use the same training set as one of these LLMs, copyright protected text and all, and use it to derive a compression algorithm that uses very little space to store tokens and token sequences that are common in that huge collection of text. The resulting compression scheme includes some sort of statistical artifact derived from that copyrighted text. Is that allowed? And if so why is an LLM different?

cj 1063 days ago

Very good question indeed.

A lot of these questions are somewhat ethical/moral in nature. E.g. is it okay to take someone else's creative work, process it through some algorithm, to create a service like ChatGPT? Or a compression algorithm? I don't know.

It's awesome to see the Copyright office request input from both sides of the argument.

livrem 1063 days ago

It worries me that so much focus is on two sides that may not have the end-users' best interest much in mind. The companies building the models may have an incentive to regulate models to keep smaller players or open source projects away. Artists mostly seem totally anti any solutions as even laws that allow models trained on purely public domain art would be bad for them. If laws around this are shaped primarily by the wishes of those two groups I am not sure things will end up well at all for those of us that want the tools to keep improving and remain reasonably free (including applications you can install locally and run on your own GPU).

chii 1063 days ago

> is it okay to take someone else's creative work, process it through some algorithm, to create a service like ChatGPT? Or a compression algorithm?

and the test i use is: if they currently allow a human to perform this same task, then it is allowed to be done using an AI model.

quickthrower2 1063 days ago

LLMs are generative though not just compressive

orbital-decay 1063 days ago

Generation, prediction, and compression are all the same - the only different thing is the intent.

stale2002 1063 days ago

> is that an AI, being a statistical model and not generally intelligent, should not be allowed to disregard the copyright of its source material

None of what you are saying has anything to do with copyright.

The tool Photoshop isn't generally intelligent either. And yet, yes it can be used to create art using other people's stuff.

And it could be done legally if the results are transformative.

jtr1 1063 days ago

Photoshop doesn’t install with a massive directory of other people’s copyrighted works to draw snippets from.

tick_tock_tick 1063 days ago

Yes it does...

devsda 1063 days ago

If it does, then Adobe would have commissioned or acquired the license. In either case they would have _paid_ someone to get those images.

It is very unlikely Adobe would be shipping their software with copyrighted material without paying for them first.

fluidcruft 1063 days ago

I personally have a really hard time finding any meaningful difference or distinction between "AI" and "lossy compression". Copyright and "lossy compression" are pretty easy to reason about. Model "building" is "compression". Model "use" is "decompression". Everything about these AI models seems to be about the "lossy" part, but "lossy" is just an adjective to the main show.

It's very difficult to not conclude that copyright of a trained model should be treated identically to the copyright of a zip file.

chii 1063 days ago

Information is not copyrighted, just the expression of said information.

So if you took a recipe book, extracted the recipe information, and listed out the recipe in a different format (such as a table), it's a new work. It does not violate the copyright of the recipe book you extracted the info from.

gwd 1063 days ago

> I personally have a really hard time finding any meaningful difference or distinction between "AI" and "lossy compression".

If you feed a photo of your dog into a JPEG compressor and the result looked like a cat in the same style, I think you'd be pretty annoyed.

CamperBob2 1063 days ago

When you perform lossy compression, you feed it one file at a time, not every file in existence.

fluidcruft 1063 days ago

If you concatenate images into a stream container (say as tar) and then compress the stream, the compression coding will (generally) cross over the individual images. True, that's generally not lossy compression.

But concatenating images is also how you create video. Lossy video compression does typically cross over frames. So I don't actually see a difference. If you want to think about mkv or mp4 instead of zip it's still the same concept.

There's nothing stopping you from putting every available image into a video and figuring out how to compress it lossily.

Maybe there's some bounds for how much information was lost? Obviously piping everything into /dev/null destroys the input. And piping /dev/random from a true random source creates information. So somewhere between that and lossless compression there's the nebulous "plagarism" threshold. And then there's another threshold that is copyright infringement that's considered "fair use".

But the general structure of the "AI" this is about are fundamentally storage and retrieval.

freejazz 1063 days ago

What does any of this have to do with creating a new expression?

fluidcruft 1063 days ago

What makes anything new? Is anything created by "AI" actually new? How much entropy is in a prompt vs in the output?

freejazz 1063 days ago

>What makes anything new?

In copyright law? It's not being a copy

tomrod 1063 days ago

Some compression, yes, but the analogy oversimplifies. AI rerepresents input information in a transformative way (embedding, say) then creates new, derived and combined output from a new input (e.g prompt).

It's not just lossy compression. It's potentially novel.

fluidcruft 1063 days ago

Phrases like "transformative way" are meaningless woospeak to me. Everything is a transformation. Sulpose I run a linear convolution on ten images and average them. Is the result "new"? Does it not contain the original images? Subspaces and mappings don't create anything "new" any more than SVD does. This is just playing digital Ship of Thesius.

tomrod 1063 days ago

> Phrases like "transformative way" are meaningless woospeak to me

Fortunately we live in a society that supports specialization where something that is woospeak to a smart person can still be a very well understood topic. AI transformations are methodologically well documented, even if transparency of neural network node activations is yet to be fully formalized.

fluidcruft 1063 days ago

In that case, you'll surely be able to provide a citation that clearly distinguishes the differences between the ways of transformations performed by "AI" and the ways of transformations performed by compression.

tomrod 1060 days ago

Sure. AI (more specifically, ML) is curve fitting, and more generally, objective function optimization. https://en.m.wikipedia.org/wiki/Curve_fitting

A projection is not compression, necessarily. And you'll find AI is a very poor compressor when used for such a purpose in all but the most trivial setups (e.g SVD matching input data rank, only reversible functions in neural network activation, etc.).

KHRZ 1063 days ago

Congratulations, you just discovered that copyright is a weak and ill-defined concept.

fluidcruft 1063 days ago

I think that unless you can clearly show that an "AI" is not a form of compression, the question of copyright is orthogonal. The copyrights that apply to a zip file may be ill-defined concepts to you, but it's not really important to the core question which is: how are model weights different from a zip file? If you put unambiguously copyrighted content into a zip file, most people would agree that the copyright applies to the zip file. So by analogy if you put copyrighted content into model weights, the copyright applies to the model weights. Issues such as what constitutes fair use comes up, but fair use is permissible copyright infringement, not absence of copyright. And that's where the question of how lossy a compression algorithm has to be to be considered "fair use". In all likelihood it's the specifics of the use itself (rather than technology or method details used) that matters.

skydhash 1063 days ago

It’s compression + filtering. Nothing generative. Its output is like 99.99 % deterministic.

tomrod 1058 days ago

Linear regression is 100% deterministic after training and isn't lossless compression, but rather a linear projection of along a manifold in a (potentially transformed) input space.

So, maybe not just compression+filtering, if level of deterministic behavior is to be the gauge.

Philpax 1063 days ago

Source?

8note 1063 days ago

Why is being a statistical model relevant?

The simplest statistical model is an average. Why would the average pixel rgba of a bunch of images invoke the copyright of those images?

chii 1063 days ago

The crux of the AI copyright argument sits in economics. Those currently producing content want future content generated from AI to benefit them financially, as long as a thin sliver of their own content was used in the training.

This is like asking all the student to pay their teachers a (small) percentage of their future economic output.

JamesBarney 1063 days ago

My opinion is we should treat AI like photoshop/word/windows. If you use windows to copy a file and distribute it, Microsoft isn't liable you are. If you use word to type up a book and sell it, you're responsible.

Same with a statistical model, if you general a copyrighted work and distribute it you are responsible. But the tool (GPT-4) maker isn't responsible just like Adobe isn't responsible for copyright infringement.

The copyrighted text/image isn't generated until you ask it to. Your prompt is what reproduces the material.

NoMoreNicksLeft 1063 days ago

Why would any non-lunatic want to live in a world where someone can't import an image into software?

If only some software is disallowed, then why permit Excel but prohibit Stable Diffusion?

Can someone even look at a SD-generated image, and claim with certainty that their own art was used to train it? Any more than claiming that another artist was inspired by it, looking at their output?

I'm fine with anything goes. The alternative seems to be copyright maximalist clownworld.

paxys 1063 days ago

> is that an AI, being a statistical model and not generally intelligent, should not be allowed to disregard the copyright of its source material

But then you are just shifting the problem forward by an inch. What happens when tomorrow someone declares that their model is generally intelligent and is therefore allowed to disregard copyright when training just like a person can?

jasonzemos 1063 days ago

This point is of the utmost importance from a public policymaking perspective. Laws such as these are easy to craft now and difficult to change later. I feel like we are previewing an unfolding disaster here.

The future will clearly yield a class of "beings" striving for some degree of indistinguishability from or coexistence with humans. Proposals that discriminate --literally discriminate -- without respect for the principles of universality and equal treatment under law are creating and condemning a marginalized group before it even reaches maturity. This is an old and tired theme repeated through history. Let's foresee this and not get it wrong.

freejazz 1063 days ago

Is it your experience that people's facial declarations cary the day in legal disputes? It's not mine. Rather, it seems like the whole thing is designed to provide scrutiny against bare facial declarations that something is true or false.

I see this on HN all the time "someone just has to claim" "someone just has to say". Yeah... that's not how it works. People can say whatever they want, that doesn't mean it satisfied their burden of proof. Self serving testimony is the lowest form of evidence imaginable.

orbital-decay 1063 days ago

Intelligence lacks any legal definition, for starters. And if a law like that will provide an arbitrary line in the sand, it will just disincentivize AI research in general.

freejazz 1063 days ago

Often, when laws are passed, they provide definitions for the terms in the law that require definitions. Regardless, I'm not aware of any proposals for copyright law where "intelligence" is used.

paulusthe 1063 days ago

I agree completely. AI model trainers should have to pay the people who provide their training materials, and there should be a default assumption of opting out until someone or their company explicitly opts in.

Unfortunately the Peter thiels and all those bizarrely out of touch silicon valley assholes have already effectively scraped the Internet because ethics don't matter if you're special like them, so to a degree regulations are way behind the ball.

That said it's still worth doing, and I'd love to see it done retroactively as well. It's not as if "I forgot that I had a public Myspace 25 years ago" is an implicit user opt-in for some startup to save your data - however anonymized they claim it is (lol!) - and train its AI on it.

zmmmmm 1063 days ago

> The alternative seems to be “anything goes”.

Seems like a huge false dichotomy. You really can't imagine anything in between total shutdown of AI training on public data sources and no rules at all?

I think we should try a bit harder for a middle ground.

lewhoo 1063 days ago

I think you are right. People argue if LLM's store or maybe generalize. I propose an experiment for anyone interested. Try and do this prompt multiple times and change the appropriate verse numbers:

> Provide quote from King James' Bible Genesis :25-31

or

> Provide quote from King James' Bible Genesis :1-25

or whatever you fancy.

I didn't go through the whole Bible, but I got pretty much a verbatim chapter. I argue that you can't do this with copyrighted books only because of guardrails and not chatgpt's lack of capability so the information is there, and it's verbatim. Plus other books don't have such nifty indexing.

mensetmanusman 1063 days ago

Because the cat is out of the bag so to speak, any attempt to force ai companies to generate their own content to train on means we are signing up for a future where only multi billion dollar companies are in control.

PaulDavisThe1st 1063 days ago

If they were truly forced to do this, even they would find it difficult.

CamperBob2 1063 days ago

And everyone else would find it impossible.

Hence the headlong rush to implement regulatory capture.

gnopgnip 1063 days ago

Is there any precedent where copyright was focused on the input rather than the final published work?

jj999 1063 days ago

Compilers

lstodd 1063 days ago

Object code is a derivative work I think.

So no. Compilers do not count.

kevinmchugh 1063 days ago

The US had to update copyright law to explicitly protect binaries

freejazz 1063 days ago

That just means some judges got it wrong and congress really wanted to make sure others didn't. I'm not sure what proposition that stands for here, except that sometimes new things are hard to get right at first.

kevinmchugh 1063 days ago

Remixes, generally?

harshreality 1063 days ago

This is more of a problem for images, where similar output to inputs is likely, than for LLMs, where no matter what you prompt it with I doubt you can get it to regurgitate any significant parts of Harry Potter well enough to be a classical copyright violation of any of the novels. Maybe you could generate a copyright violation of character traits.

The output space of images (MB for larger images) tends to be larger than books (a few hundred KB of text for a long novel), but the perceptual output space of books is much larger.

Any determination that licensing is required for AI generation, or use of AI-generated works, is unacceptable until Congress or courts put some reasonable objective tests in place to determine what is and isn't a copyright violation for various types of works of various lengths. Not the ambiguous 4-factor test that is basically whatever the judge feels like. It will be a complete mess otherwise. They can't just define a new AI policy for copyright with a few types of works in mind; it has to work for all works.

You could look at this mathematically from a complexity perspective and try to define a similarity function that's true when a second work is close enough to a first work to be a derived work (assuming the first one had been seen by the creator of the second). Unfortunately that won't work because nobody can define such a function to everyone's satisfaction, and the courts wouldn't accept any informal suggestion of a definition when it didn't come from Congress. Specifically, you'd get into trouble with consistency in the function determining derived works depending on length of the work: short works, like a haiku, are much more sensitive to copyright violation in some ways... a mere 17 syllables is a complete reproduction and therefore a copyright violation, yet a single word isn't; for a novel, reproducing 1/17 of the content is almost certainly a copyright violation, but reproducing 17 syllables probably isn't.

Different stakeholders and creative re-mixers would want different things from the function. It's untenable.

judge2020 1063 days ago

> This would, I think, require the AI’s creator to secure a license for all of its sources that allows this sort of transformation and presentation

That is a fairly illogical leap. From your text alone, “should not be allowed to disregard the copyright of its source material” would be: “the AI’s maintainer should have a fairly reliable (but not infallible) system to output how likely it generated something that is a direct derivative work of something in its dataset”. As a human you don’t need to attribute/license every piece of art you’ve seen of clouds if you draw a cloud. So if an AI draws a cloud that is actually derivative of the millions of clouds it has seen, then it doesn’t need any permission from the millions of creators to draw one either.

rmbyrro 1063 days ago

AI is taking work away from lawyers, and instantly creating more work for lawyers.

Ain't that interesting to reflect upon?

I speculate there is a hidden force in the universe, something physicists are yet to identify, which mandates: "they shall always have something to do".

mjan22640 1063 days ago

The human brain is no different. It generates content from the things it learned.

CatWChainsaw 1061 days ago

Repost #4 I believe

https://news.ycombinator.com/item?id=37305580

"I'll keep saying it every time this comes up. I LOVE being told by techbros that a human painstaking studying one thing at a time, and not memorizing verbatin but rather taking away the core concept, is exactly the same type of "learning" that a model does when it takes in millions of things at once and can spit out copyrighted code verbatim."

gaganyaan 1063 days ago

I hope your opinion isn't shared by lawmakers. Copyright is a relic of the past, and it needs to be put out of its misery. Trying to (mis)apply copyright here would just lobotomize the US. Existing companies would just technically operate out of a saner jurisdiction, and we'd be handing other countries a golden opportunity to leapfrog the US.

scotty79 1063 days ago

"anything goes" is the best and most natural solution. Just don't let people copyright the output if they don't have full copyright on all of the inputs. This should finally get rid of the cancer that is copyright in a generation or two.

rickmode 1063 days ago

Generic reply to siblings here… I get the intelligence argument.

My _main_ point is that there’s a non-trivial question to answer here.

I’m not qualified to answer (though I’ve offered up my non-expert opinion). It certainly seems to quickly veer in to philosophy!

jillesvangurp 1063 days ago

It shows you are not a lawyer. You misunderstand how copyright works. Creating copies or derivative works and distributing those is all that matters under copyright. This is not "disregarding" copyright (which is not an actual thing) but something that is either fair use or may require some kind of permission from the creators of the original by those distributing some kind of derived work or copy. That's why it's called copyright.

Copyright merely restricts the distribution of original works or their derivatives. In case of an infringement, copyright holders can insist you stop distribution and/or compensate them for that.

If I sell you a paint brush, I'm not liable for you putting a red nose on the mona lisa and trying to sell it off as an original work. Doing that on the original would be an act of vandalism (because you don't own it) and doing that on a replica that you got from somewhere infringes on the rights of those that created the replica. Which is a derived work or copy in itself of course and the distribution of that is regulated by copyright. Distribution of such a replica is of course fine because Da Vinci has been dead for a very long time and his work would no longer be protected under copyright. Distributing your red nosed mona lisa would therefore be fine too. Either way, the paint brush seller is no party in this case this is between you, Da Vinci, his descendants, and the replica creators.

Now your assertions as to what AIs are of aren't, are simply not relevant. You assert it's a statistics algorithms thingy. That sounds like a tool to me. Yet another paint brush. Using a paint brush is not infringing on anyone's rights. For that you have to distribute the results of your work. The nature of the tool does not matter. How you use the tool does not matter either. You merely create (potentially) derivative works with the tool and what you do with those matters. Especially when you distribute them to others. One of those derivative works is of course the AI model itself. Creating one is fine. Copyright gets potentially infringed when you distribute one.

Now we get to the core of the matter. Can you with a straight face say the AI model resembles the original and is a derivative work. It doesn't actually look like or resemble the original in any shape or form. Even proving the AI model is derived from the original is tricky. Copyright is not about protecting vague ideas or notions but the concrete shape or form of things. And it's only an infringement if you distribute a derived work or a copy of a thing to others. So, merely creating an AI model is not distributing anything to anyone. You are merely using tools to create something for yourself. An AI model in this case.

Distributing a verbatim copy of a book is an infringement. Citing the book in your own work is fair use (up to a point). Paraphrasing elements from the book, acknowledging it exists, taking inspiration of it, or reading it aren't copyright infringements.

The legal problem with AI models is that their concrete shape or form doesn't resemble the original inputs in any shape or form. Besides, companies like OpenAI don't actually distribute their AI models. They are huge; it's not very practical. They merely exploit those models to generate outputs to inputs from their users and customers. Are those outputs derivative works? Maybe, but that's where it gets tricky. They clearly aren't in the classical sense. Not even close. But if you somehow could conclude that they are, who is distributing that derivative work? Secondly, it the AI model is a tool, who actually creates those outputs and are those outputs protected under copyright? Who actually holds those rights? And how would you tell apart such an output from a human created one?

It's questions like this that make all this extremely murky from a legal point of view. IMHO without dramatic changes to copyright law or the way it has been commonly interpreted legally, it's just very poorly suited to do anything about stopping AI companies from doing what they are doing. You'd have to bend the conventional interpretation quite a bit for that. No doubt, there will be court cases where people will try to do that. But it will take many years before the dust settles on that. And I wouldn't get my hopes up on some unexpected/dramatic outcome.

freejazz 1063 days ago

This is generally, but I'm surprised you aren't aware that distribution isn't the only right protected by copyright - creating derivative works is protected, display rights are protected.