Hacker News new | ask | show | jobs
by mattxxx 403 days ago
Well, firing someone for this is super weird. It seems like an attempt to censor an interpretation of the law that:

1. Criticizes a highly useful technology 2. Matches a potentially-outdated, strict interpretation of copyright law

My opinion: I think using copyrighted data to train models for sure seems classically illegal. Despite that, Humans can read a book, get inspiration, and write a new book and not be litigated against. When I look at the litany of derivative fantasy novels, it's obvious they're not all fully independent works.

Since AI is and will continue to be so useful and transformative, I think we just need to acknowledge that our laws did not accomodate this use-case, then we should change them.

16 comments

> Humans can read a book, get inspiration, and write a new book and not be litigated against

Humans get litigated against this all the time. There is such thing as, charitably, being too inspired.

https://en.wikipedia.org/wiki/List_of_songs_subject_to_plagi...

If you follow these cases more closely over time you'll find that they're less an example of humans stealing work from others and more an example of typical human greed and pride. Old, well established musicians arguing that younger musicians stole from them for using a chord progression used in dozens of songs before their own original, or a melody on the pentatonic scale that sounds like many melodies on the pentatonic scale do. It gets ridiculous.

Plus, all art is derivative in some sense, it's almost always just a matter of degree.

> art is derivative in some sense, it's almost always just a matter of degree.

Yes, that's why we judge on a case by case basis. The line is blurry.

I think when you're storing copies of such assets in your database that you're well past the line, though.

To the point that Billy Joel "famously" credited the songwriter for one of his songs ("This Night") as "Billy Joel, Ludwig van Beethoven".
The law covers these cases pretty well, it is just that the law has very powerful extremely rich adversaries, whose greed has gotten the better of them again and again. They could use work released sufficiently long ago to be legally available, or they could take work released as creative commons, or they could run a lookup, to make sure to never output verbatim copies of input or outputs, that are within a certain string editing distance, depending on output length, or they could have paid people to reach out to all the people, whose work they are infringing upon. But they didn't do any of that, of course, because they think they are above the law.
I'm confused, so you're saying its illegal? Because last I checked it's still in the process of going through the courts. And need we forget that copyright's purpose is to advance the arts and sciences. Fair use is codified into law, which states each case is seen on a use by use basis, hence the litigation to determine if it is in fact, legal.
It’s so fucking obviously illegal when you think about it rationally for more than a few seconds. We aren’t even talking about “fair use” we are talking about how it works in practice which was Meta torrenting pirated books, never paying anyone a cent and straight up stealing the content at scale.
The fact you are even using the word stealing, is telling to your lack of knowledge in this field. Copyright infringement is not stealing[0]. The propaganda of the copyright cartel has gotten to you.

[0] https://en.wikipedia.org/wiki/Dowling_v._United_States_(1985...

> Copyright infringement is not stealing

If we can agree that taking away of your time is theft (wage theft, to be precise), we as those who rely on intellect in our careers should be able to agree that the taking of our ideas is also theft.

>moved to the Ninth Circuit Court of Appeals, where he argued that the goods he was distributing were not "stolen, converted or taken by fraud", according to the language of 18 U.S.C. 2314 - the interstate transportation statute under which he was convicted. The court disagreed, affirming the original decision and upholding the conviction. Dowling then took the case to the Supreme Court, which sided with his argument and reversed the convictions.

This just tells me that the definition is highly contentious. Having the supreme court reverse a federal ruling already shows misalignment.

I still feel like the point is useless, because at the end of the day, if some normal person went ahead and did the same thing the tech giant did, they would long be moved to a less comfortable new home, that has high security against breaking in. At the end of the day, the situation now is, that some are more equal than others, and it is unacceptable, yet, due to the mountains of (also unethically acquired) cash they have, they can get away with something a normal person cannot. Even the law might be bent to their will, because if suing them fails, it creates precedence.

If we end up saying it is not illegal, then I demand, that it will not be illegal for everyone. No double standards please. Let us all launder copyrighted material this way, labeling it "AI".

> The fact you are even using the word stealing, is telling to your lack of knowledge in this field.

I agree. If you can pay the judge, the congress or the president, it is definitely not stealing. It is (the best) democracy (money can buy). /s

So when someone steals something from you, you no longer have it. Yet here they paid the judge(s) because the person who's been "robbed" still has their thing?
A test to apply here: If you or I did this, would it be illegal? Would we even be having this conversation?

The law is supposed to be impartial. So if the answer is different, then it's not really a law problem we're talking about.

Obviously a revenue tracking weight should be trained in allowing the tracking and collection of all values generated from derivative works.
Humans are not allowed to do what AI firms want to do. That was one of the copyright office arguments: a student can't just walk into a library and say "I want a copy of all your books, because I need them for learning".

Humans are also very useful and transformative.

Or we could acknowledge that something could be a bad idea, despite its utility
> Despite that, Humans can read a book, get inspiration, and write a new book and not be litigated against.

You're still not gonna be allowed to commercially publish "Hairy Plotter and the Philosophizer's Rock".

No, but you are most likely allowed to commercially publish "Hairy Potter and the Philosophizer's Rock", a story about a prehistoric community. The hero is literally a hairy potter who steals a rock from a lazy deadbeat dude who is pestering the rest of the group with his weird ideas.
Not sure what you are getting at?
You are if it's parody, cf 'Bored of the Rings'.
Assuming this means copyright is dead, companies will be vary upset and patents will likely follow.

The hold US companies have on the world will be dead too.

I also suspect that media piracy will be labelled as the only reason we need copyright, an existing agency will be bolstered to address this concern and then twisted into a censorship bureau.

Then they need to be changed for everyone and not just AI companies, but we all know that ain't happening.
The problem with this kind of analysis is that it doesn't even try to address the reasons why copyright exists in the first place. This belief that training LLMs on content without permission should be allowed is incompatible with the belief that copyright is useful, you really have to pick a lane here.

Go back to the roots of copyright and the answers should be obvious. According to the US constitution, copyright exists "To promote the Progress of Science and useful Arts" and according to the EU, "Copyright ensures that authors, composers, artists, film makers and other creators receive recognition, payment and protection for their works. It rewards creativity and stimulates investment in the creative sector."

If I publish a book and tech companies are allowed to copy it, use it for "training", and later regurgitate the knowledge contained within to their customers then those people have no reason to buy my book. It is a market substitute even though it might not be considered such under our current copyright law. If that is allowed to happen then investment will stop and these books simply won't get written anymore.

it's funny how a law becomes potentially-outdated only when big corporations want to violate in on a global scale.

As a private person I no longer feel incentivised to create new content online because I think that all I create will eventually be stolen from me...

> Piracy refers to the illegal act of copying, distributing, or using copyrighted material without authorization. It can occur in various forms

Professing of IP without a license AND offering it as a model for money doesn't seem like an unknown use-case to me

>My opinion: I think using copyrighted data to train models for sure seems classically illegal. Despite that, Humans can read a book, get inspiration, and write a new book and not be litigated against. When I look at the litany of derivative fantasy novels, it's obvious they're not all fully independent works.

Huh? If you agree that "learning from copyrighted works to make new ones" has traditionally not been considered infringement, then can you elaborate on why you think it fundamentally changes when you do it with bots? That would, if anything, seem to be a reversal of classic copyright jurisprudence. Up until 2022, pretty much everyone agreed that "learning from copyrighted works to make new ones" is exactly how it's supposed to work, and would be horrified at the idea of having to separately license that.

Sure, some fundamental dynamic might change when you do it with bots, but you need to make that case in an enforceable, operationalized way.

Sorry but AI isn't that useful and I don't see it becoming any more useful in the near term. It's taken since ~1950 to get LLMs working well enough to become popular and they still don't work well.
Pirating movies is also useful, because I can watch movies without paying on devices that apps and accounts don't work on.

That doesn't make piracy legal, even though I get a lot of use out of it.

Also, a person isn't a computer so the "but I can read a book and get inspired" argument is complete nonsense.

It's only complete non-sense if you understand how humans learn. Which we don't.

What we do know though is that LLMs, similar to humans, do not directly copy information into their "storage". LLMs, like humans, are pretty lossy with their recall.

Compare this to something like a search indexed database, where the recall of information given to it is perfect.

Well, you don't get to pick and choose in which situations an LLM is considered similar to a human being and in which not. If you argue that it similarly to a human is lossy, well let's go ahead and get most output checked by organizations and courts for violations of the law and licenses, just like human work is. Oh wait, I forgot, LLMs are run by companies with too much cash to successfully sue them. I guess we just have to live with it then, what a pity.
There are a couple of ways to theoretically prevent copyright violations in output. For closed models that aren't distributed as weights, companies could index perceptual hashes of all the training data at a granular level (like individual paragraphs of text) and check/retry output so that no duplicates or near-duplicates of copyrighted training data ever get served as a response to end users.

Another way would be to train an internal model directly on published works, use that model to generate a corpus of sanitary rewritten/reformatted data about the works still under copyright, then use the sanitized corpus to train a final model. For example, the sanitized corpus might describe the Harry Potter books in minute detail but not contain a single sentence taken from the originals. Models trained that way wouldn't be able to reproduce excerpts from Harry Potter books even if the models were distributed as open weights.

Youtube built probably the most complex and proactive copyright system any organization has ever seen, for the sole purpose of appeasing copyright holders. There is no reason to believe they won't do the same thing for LLM output.
And everyone here is downloading every show and movie in existence without even a hint of guilt.
Why would u have guilt of using an unlimited resource? Youre not stealing
>Despite that, Humans can read a book, get inspiration, and write a new book and not be litigated against.

Corporations are not humans. (It's ridiculous that they have some legal protections in the US like humans, but that's a different issue). AI is also not human. AI is also not a chipmunk.

Why the comparison?

Doing a cover song requires permission, and doing it without that permission can be illegal. Being inspired by a song to write your own is very legal.

AI is fine as long as the work it generates is substantially new and transformative. If it breaks and starts spitting out other peoples work verbatim (or nearly verbatim) there is a problem.

Yes, I'm aware that machines aren't people and can't be "inspired", but if the functional results are the same the law should be the same. Vaguely defined ideas like your soul or "inspiration" aren't real. The output is real, measurable, and quantifiable and that's how it should be judged.

I fear the lack of our ability to measure your mind might render you without many of the legal or moral protections you imagine you have. But go ahead, tare down the law to whatever inanity can be described by the trivial machines of the world's current popular charlatans. Presumably you weren't using society's presumption of your agency anyway.
> I fear the lack of our ability to measure your mind might render you without many of the legal or moral protections you imagine you have.

Society doesn't need to measure my mind, they need to measure the output of it. If I behave like a conscious being, I am a conscious being. Alternatively you might phrase it such that "Anything that claims to be conscious must be assumed to be conscious."

It's the only answer to the p-zombie problem that makes sense. None of this is new, philosophers have been debating it for ages. See: https://en.wikipedia.org/wiki/Philosophical_zombie

However, for copyright purposes we can make it even simpler. If the work is new, it's not covered by the original copyright. If it is substantially the same, it isn't. Forget the arguments about the ghost in the machine and the philosophical mumbo-jumbo. It's the output that matters.

In your case, it isnt the output that matters. Your saying "I'm conscious" isn't why we attribute consciousness to you. We would do so regardless of your ability to verbalise anything in particular.

Your radical behaviourism seems an advantage to you when you want to delete one disfavoured part of copyright law, but I assure you, it isn't in your interest. It doesnt universalise well at all. You do not want to be defined by how you happen to verbalise anything, unmoored from your intention, goals, and so on.

The law, and society, imparts much to you that is never measured and much that is unmeasurable. What can be measured is, at least, extremely ambiguous with respect to those mental states which are being attributed. Because we do not attribute mental states by what people say -- this plays very little role (consider what a mess this would make of watching movies). And none of course in the large number of animals which share relevant mental states.

Nothing of relevance is measured by an LLM's output. It is highly unambigious: the LLM has no mental states, and thus is irrelevant to the law, morality, society and everything else.

It's a obcene sort of self-injury to assume that whatever kind of radical behaviourism is necessary to hype the LLM is the right sort. Hype for LLMs does not lead to a credible theory of minds.

> We would do so regardless of your ability to verbalise anything in particular

I don't mean to say that they literally have to speak the words by using their meat to make the air vibrate. Just that, presuming it has some physical means, it be capable (and willing) to express it in some way.

> It's a obcene sort of self-injury to assume that whatever kind of radical behaviourism is necessary to hype the LLM is the right sort.

I appreciate why you might feel that way. However, I feel it's far worse to pretend we have some undetectable magic within us that allows us to perceive the "realness" of others peoples consciousness by other than physical means.

Fundamentally, you seem to be arguing that something with outputs identical to a human is not human (or even human like), and should not be viewed within the same framework. Do you see how dangerous an idea that is? It is only a short hop from "Humans are different than robots, because of subjective magic" to "Humans are different than <insert race you don't like>, because of subjective magic."

> Doing a cover song requires permission, and doing it without that permission can be illegal.

I believe cover song licensing is available mechanically; you don't need permission, you just need to follow the procedures including sending the licensing fees to a rights clearing house. Music has a lot of mechanical licenses and clearing houses, as opposed to other categories of works.

> you don't need permission, you just need to follow the procedures

Those procedures are how you ask for permission. As you say, it usually involves a fee but doesn't have to.

(in the US) Mechanical licenses are compulsory; you don't need permission, you can just follow the forms and pay the fees set by the Copyright Royalty Board (appointed by the Librarian of Congress). You can ask the rightsholder to negotiate a lower fee, but there's no need for consent of the rightsholder if you notify as required (within 30 days of recording and before distribution) and pay the set fees.
Thanks for clarifying. Sometimes I forget that HN has a lot experts floating around who take things in a very literal and legalistic way. I was speaking in more general terms, and missed that you were being very precise with your language.

Compulsory licenses are interesting aren't they? It just feels wrong. If Metallica doesn't want me to butcher their songs, why should the be forced to allow it?

They are very interesting. IMHO, it's a nice compromise between making sure the artists are paid for their work, and giving them complete control over their work. Licensing for radio-style play is also compulsory, and terrestrial radio used to not even have to pay the recording artists (I think this changed?), but did have to track and pay to ASCAP.

As a consumer, it would amazing if there were compulsory licenses for film and tv; then we wouldn't have to subscribe to 70 different services to get to the things we want to see. And there would likely be services that spring up to redistribute media where the rightsholders aren't able to or don't care to; it might be pulled from VHS that fans recorded off of TV in the old days, but at least it'd be something.

Any live band performing a song is subject to mechanical licensing as much as a recording artist. Typically the venue pays it, just like how radio stations pay royalties. This system exists because historically, that's how music reproduction worked. You hire some musicians to play the music you want to hear. Copyright applied to the score, the lyrics, and so on. The 'mechanical' rights had to come later, because recording hadn't been invented yet!
"If it breaks and starts spitting out other peoples work verbatim (or nearly verbatim) there is a problem."

Why is that? Seems all logic gets thrown out the window when invoking AI around here. References are given. If the user publishes the output without attribution, NOW you have a problem. People are being so rabid and unreasonable here. Totally bat shit.

> If the user publishes the output without attribution, NOW you have a problem.

I didn't meant to imply that the AI can't quote Shakespeare in Context, just that it shouldn't try to pass off Shakespeare as it's own or plagiarize huge swathes of the source text.

> People are being so rabid and unreasonable here.

People here are more reasonable than average. Wait until mainstream society starts to really feel the impact of all this.

Thank you - a voice of sanity on this important topic.

I understand people who create IP of any sort being upset that software might be able to recreate their IP or stuff adjacent to it without permission. It could be upsetting. But I don't understand how people jump to "Copyright Violation" for the fact of reading. Or even downloading in bulk. The Copyright controls, and has always controlled, creation and distribution of a work. In the nature even of the notice is embedded the concept that the work will be read.

Reading and summarizing have only ever been controlled in western countries via State's secrets type acts, or alternately, non-disclosure agreements between parties. It's just way, way past reality to claim that we have existing laws to cover AI training ingesting information. Not only do we not, such rules would seem insane if you substitute the word human for "AI" in most of these conversations.

"People should not be allowed to read the book I distributed online if I don't want them to."

"People should not be allowed to write Harry Potter fanfic in my writing style."

"People should not be allowed to get formal art training that involves going to museums and painting copies of famous paintings."

We just will not get to a sensible societal place if the dialogue around these issues has such a low bar for understanding the mechanics, the societal tradeoffs we've made so far, and is able to discuss where we might want to go, and what would be best.

Exactly, it is an immense privilege to have your works preserved and promulgated through the ages for instant recall and automated publishing. It's literally what everyone wants. The creators and the consumers. The AI companies are not robbing your money or IP. Period.
If it was as obvious as you claim, the legal issues would already be settled, and your characterization of what LLMs are doing as "reading and summarizing" is hilariously disingenuous and ignores essentially the entire substance of the debate (which is happening not just on internet forums but in real courts, where real legal professionals and scholars are grappling with how to fit AI into our framework of existing copyright law, e.g.^[1]).

Of course, if you start your thought by dismissing anybody who doesn't share your position as not sane, it's easy to see how you could fail to capture any of that.

^[1] https://arstechnica.com/tech-policy/2025/05/judge-on-metas-a...

> But I don't understand how people jump to "Copyright Violation" for the fact of reading.

The article specificaly talks about the creation and distribution of a work. Creation and distribution of a work alone is not a copyright violation. However, if you take in input from something you don't own, and genAI outputs something, it could be considered a copyright violation.

Let's make this clear; genAI is not a copyright issue by itself. However, gen AI becomes an issue when you are using as your source stuff you don't have the copyright or license to. So context here is important. If you see people jumping to copyright violation, it's not out of reading alone.

> "People should not be allowed to read the book I distributed online if I don't want them to."

This is already done. It's been done for decades. See any case where content is locked behind an account. Only select people can view the content. The license to use the site limits who or what can use things.

So it's odd you would use "insane" to describe this.

> "People should not be allowed to write Harry Potter fanfic in my writing style."

Yeah, fan fiction is generally not legal. However, there are some cases where fair use covers it. Most cases of fan fiction are allowed because the author allows it. But no, generally, fan fiction is illegal. This is well known in the fan fiction community. Obviously, if you don't distribute it, that's fine. But we aren't talking about non-distribution cases here.

> "People should not be allowed to get formal art training that involves going to museums and painting copies of famous paintings."

Same with fan fiction. If you replicate a copyrighted piece of art, if you distribute it, that's illegal. If you simply do it for practice, that's fine. But no, if you go around replicating a painting and distribute it, that's illegal.

Of course, technically speaking, none of this is what gen AI models are doing.

> We just will not get to a sensible societal place if the dialogue around these issues has such a low bar for understanding the mechanics

I agree. Personifying gen AI is useless. We should stick to the technical aspects of what it's doing, rather than trying to pretend it's doing human things when it's 100% not doing that in any capacity. I mean, that's fine for the the layman, but anyone with any ounce of technical skill knows that's not true.

>Yeah, fan fiction is generally not legal. However, there are some cases where fair use covers it.

Which is a clear failure of the copyright system. Millions of people are expanding our cultural artifacts with their own additions, but all of it is illegal, because they haven't waited another 100 years.

People are interested in these pieces of culture, but they're not going to remain interested in them forever. At least not interested enough to make their own contributions.

> Let's make this clear; genAI is not a copyright issue by itself. However, gen AI becomes an issue when you are using as your source stuff you don't have the copyright or license to. So context here is important. If you see people jumping to copyright violation, it's not out of reading alone.

My proposal is that it's a luddish kneejerk reaction to things people don't understand and don't like. They sense and fear change. For instance here you say it's an issue when AI uses something as a source that you don't have Copyright to. Allow me to update your sentence: "Every paper every scientist or academic wrote that references any copyrighted work becomes an issue". What you said just isn't true. The copyright refers to the right to copy a work.

Distribution: Sure. License your content however you want. That said, in the US a license prohibiting you from READING something just wouldn't be possible. You can limit distribution, copying, etc. This is how journalists can write about sneak previews or leaked information or misfiled court documents released when they should be under seal. The leaking <-- the distribution might violate a contract or a license, but the reading thereof is really not a thing that US law or Common law think they have a right to control, except in the case of the state classifying secrets. As well, here we have people saying "my song in 1983 that I put out on the radio, I don't want AI listening to that song." Did your license in 1983 prohibit computers from processing your song? Does that mean digital radio can't send it out? Essentially that ship has sailed, full stop, without new legislation.

On my last points, I think you're missing my point, Fan fiction is legal if you're not trying to profit from it. It is almost impossible to perfectly copy a painting, although some people are pretty good at it. I think it's perfectly legal to paint a super close copy of say Starry Night, and sell it as "Starry night by Jason Lotito." In any event, the discourse right now claims its wrong for AI to look at and learn from paintings and photographs.

> My proposal is that it's a luddish kneejerk reaction to things people don't understand and don't like.

Your proposal is moving goal posts.

> Allow me to update your sentence: "Every paper every scientist or academic wrote that references any copyrighted work becomes an issue".

No, I never said that. Fair Use exists.

> Fan fiction is legal if you're not trying to profit from it.

No, it's not.[1] You can make arguments that it should be, but, no.

[1] https://jipel.law.nyu.edu/is-fanfiction-legal/

> I think you're missing my point

I think you got called out, and you are now trying to reframe your original comment so it comes across as having accounted for the things you were called out on.

You think you know what you are talking about, but you don't. But, you rely on the fact that you think you do to lose the money you do.

"However, gen AI becomes an issue when you are using as your source stuff you don't have the copyright or license to."

Absolute horse shit. I can start a 1-900 answer line and use any reference I want to answer your question.

> Absolute horse shit.

I agree, what followed was.

> I can start a 1-900 answer line and use any reference I want to answer your question

Yeah, that's not what we are talking about. If you think it was, you should probably do some more research on the topic.