Hacker News new | ask | show | jobs
by flessner 416 days ago
If the former ever gets tested in court, it's the end of the road. All major AI companies have trained on copyrighted work, one way or another.

What is inspiration? What is imitation? What is plagiarism? The lines aren't clearly drawn for humans... much less for LLMs.

9 comments

> If the former ever gets tested in court, it's the end of the road. All major AI companies have trained on copyrighted work, one way or another.

I can absolutely guarantee you that neither DeepSeek nor Alibaba's highly talented Qwen group will care even a little bit, in the long run. Not if there's value to be had in AI. (And I can tell you down to the dollar what LLMs can save in certain business use cases.)

If the US decides to unilaterally shut down LLMs, that just means that the rest of the world will route around us. Whether this is good or bad is another question.

The pattern hasn't changed in decades. Remember when ZTE copied Cisco's router code so precisely they included the same bugs and documentation typos?

LLMs are a drop on a hot stone compared to countless other factors why the world already is routing around the US - but I don't want to get political or economical.

> If the US decides to unilaterally shut down LLMs, that just means that the rest of the world will route around us.

You're talking as if they are some kind of nationalized or publically-owned asset, as opposed to a bunch of for-profit, privately-owned silos.

Local models are a thing, though. You can run DeepSeek on your local computer.

Even if ChatGPT, Huggingface, etc. died, we would still have the models and we would still be able to run them.

Local models mitigate many of the ethical concerns, but that's not what the end game is.

The firms pouring trillions of dollars into them want to own their creative output, and charge rent for access to it.

These cases can set precedents that basically shut down all all future "useful" AI implementations if the judges go too far. You can bet that CCP doesn't care one whit about US copyright law if leads to them leapfrogging us, it will definitely be j-johna-jameson-laugh.gif . I think that's the point of the commenter from 20000ft
> if leads to them leapfrogging us

They won't be leapfrogging 'us'.

They'll be leapfrogging some privately owned, for-profit business.

Frankly, I don't give two shits about that, and I fail to see why anyone who doesn't own one of them should give one.

If they want me to have skin in the game, they should share either the profits, or the models. Until they do, it's not 'us', it's 'them'.

I care about the USA remaining relevant. If you don't that's fine as well. Have a great life
China found the perfect way to disrupt US tech, releasing open source versions of it for free or at least cheaper. Most of US tech is built on open source anyways and with the pace YC is investing in open source alternatives, it will win out in most niches.

My fear is that the US tech won’t be able to compete with state sponsored open source out of China and will move to ban open source or suppress it somehow.

Also, the Chinese work is legit. DeepSeek introduced a whole bag of new techniques like GRPO, and released quite a bit of good open source tooling.

And Alibaba's Qwen team seems to be quite genuinely talented at "small" models, 32B parameters and below. Once you get Qwen3 properly configured, it punches well above its "weight class." I'm still running real benchmarks, but subjectively, it feels like the 32B model performs somewhere between 4o-mini and 4o on "objectively measureable" tasks. It's a little "stodgy" and formal by default, though. We'll see what it looks like when people start fine-tuning it.

If the US dropped off the planet, it would maybe set LLM technology back a year.

Deepseek really changed how people think about Chinese tech. Even after new LLMs launched, Deepseek R1 and V3 hold their own on benchmarks and are significantly cheaper.
> China found the perfect way to disrupt US tech, releasing open source versions of it for free or at least cheaper.

Meta did this first I believe?

> And I can tell you down to the dollar what LLMs can save in certain business use cases.)

Please do!!

$3.50.

But seriously, LLMs are useless for anything except the most basic secretarial tasks and even there they still require as much reviewing as a high school intern. The reason executives want to use AI instead of humans is because AI is a capital expense that does not hit the financial the same way that wages do. (Capital expenses are excluded from EBITDA, which is the most common metric for measuring a company's financial performance. But it's become so popular for companies to push expenses to ITDA whenever they can that financial analysts are starting to push back and include ITDA in their analyses.)

In a nutshell, there are 3 ways to look at company financials: the PR way (EBITDA), the financial reporting way (GAAP, which includes ITDA), and the tax way (starts from GAAP or IFRS but with numerous rules on what items are included or excluded).

What's the point of being proud of one system of government if you're willing to relinquish it in the face of adversary?

Shouldn't they have to follow the law?

You point to Chinese companies disregarding any rules if there is value to be had in AI, while in the US, AI companies going to get 500 billion investment and a whistleblower is dead.

US AI companies will either make sure that a similar ruling will never be made or they will ignore it and pay the fines. They won't let anybody stop the gravy train.

Or AI companies could use some of their vast reserves of cash to pay for licensing agreements and pay people for their fucking intellectual property, then feed it to the beast.

But then they'd have to actually communicate with people and negotiate consent instead of just hoovering up everything they can get their hands on in their quest to replace it.

> If the former ever gets tested in court, it's the end of the road. All major AI companies have trained on copyrighted work, one way or another.

You assume that getting tested means the AI trainers lose, and also thar the model architectures that have been developed can’t be retrained from scratch with public domain, owned, and purpose-licensed material. (With several AI companies having been actively pursuing deals to license content for AI training for a while now.)

> If the former ever gets tested in court, it's the end of the road. All major AI companies have trained on copyrighted work, one way or another.

End of the road for major AI companies, and hopefully something better can be created once it's declared illegal without any murky waters.

There are LLMs trained on data that isn't illegally obtained, OLMo by Ai2 is one such model, that is actually open source and uses open data for training. Just because it's "very difficult" for OpenAI et al shouldn't be an argument to force them to behave ethically anyways. If they cannot survive acting legally, then so be it, sucks for them.

That would hardly be the end of the road. If copyright enforcement gets stricter then that will give a market advantage to the largest, best funded major AI companies like OpenAI because they can afford to simply buy licenses from copyright holders. I predict that we'll see new middlemen arise specifically to handle this licensing, much like the agencies that handle most music licensing today.
It's not the end, all these companies have "clean" datasets which they train their models on now, along with training on the previous "dirty" models. But it's been so many generations, that they don't need to worry about this copyright issue anymore
The lines for humans aren't clearly drawn, but they are drawn. The main difference is that humans are humans and LLMs are computer programs.

I see no reason why we should even entertain the idea of extending human rights to computer programs, and so far, nobody has been able to give me any good reasons why.

Furthermore, why are we only entertaining the human rights that can be used for profit-driven purposes? Why do LLMs, for example, not have the right to free speech? Or an attorney? It seems highly unethical to grant these computer programs some protections as if they're humans but not grant them personhood. This is akin to slavery, which is something we actually have to consider. Anthropomorphization is a double-edged sword. We cannot simultaneously consider them human when convenient and then consider them programs when it's not. Or, if we want to do that, we need to form coherent argument to why, how, and when.

You're thinking about it using the wrong framework IMO.

It's not about the program's rights, it's about the human's rights to use the program. Not the machine's right to do something, but the human's right to do something through a machine, or make a machine do something.

No, because the entire argument hinges on the fact that LLMs learn, which is like humans learning, so it's transformative. That only works if you consider learning or transformation to be something that does not rely on the human spirit. Which, actually, most people do not believe. And it's pretty difficult to argue - we don't even know how learning works for people.

A lot of people just jump to LLMs learning like it's a foregone conclusion. Mm... no. You need to convince people of that. You'll find if you talk to non-tech people, they're not just going to believe you when you say that.

Why isn't an LLM more akin to a database or a compression algorithm? Why is it closer to human learning? After all, humans are humans and we have the exclusive right and power to determine what is human and what isn't. And database and compression algorithms are computer programs, of the same kind as an LLM.

Then the results would be the same, and it would still be fair use. I have yet to see an example that demonstrates LLMs plagiarize by default or by tendency.

Your causality seems to be inverted here. You seem to be implying that "learning" (or the ingestion and retention of information for the same means) is banned by default for everything, but we decide to allow it for humans as the sole exception. This is not the case. Everything not prohibited is allowed, and "intermediate copies" are considered to be vital to fair use by the court system.

No, it wouldn't. Because if I record "Revenge of the Sith", compress it, and then distribute it for free online, that's obviously not fair use.

Fair use is pretty complicated. Part of Fair Use is the "The Effect of the Use on the Potential Market for or Value of the Work", which already puts even human commercial endeavors in a tough spot. You can make it work, but you have to really try. Satire like Weird Al or whatever isn't competing with the music it's satirizing, the venn diagram between those markets barely overlap. But a lot of LLM use cases are explicitly meant to obsolesce and siphon value from the things they used.

Like, why go to Getty Images when you could instead go to the glorified database, which has ingested all of Getty Images, and acquire an indistinguishable stock photo for free?

The only reason we're even really entertaining this is because people continually draw parallels to humans. You see, it's not stealing from Getty. It's more like if someone saw Getty Images and then went out and took a photo in that same flat, boring style. Except nobody saw anything. And nobody went out an took a photo.

> The only reason we're even really entertaining this is because people continually draw parallels to humans. You see, it's not stealing from Getty. It's more like if someone saw Getty Images and then went out and took a photo in that same flat, boring style. Except nobody saw anything. And nobody went out an took a photo.

But unless your argument is that the photo outputs from the GenAI are literally equivalent to the training data, you would agree the end result is the same, right? Anyone can see that the images are not the training data stitched together, so it doesn't even really matter how it all works mechanistically, even though your description ("glorified database") is wrong.

> That only works if you consider learning or transformation to be something that does not rely on the human spirit.

Even changes made using simple non-ML algorithms can be transformative according to fair use doctrine, like the thumbnailing of images done by search engines. It's not meant in some spiritual sense.

The reason that’s okay is because you aren’t competing against the initial source material. A thumbnail on Google for “Revenge of the Sith” is not a replacement for watching the movie.

But, a lot of AI products are specifically and explicitly designed to obsolesce the thing they trained off. No need to go to Encyclopedia X or the NYT, this has the same content.

> The reason that’s okay is because you aren’t competing against the initial source material.

I don't mean to claim that search engine image thumbnailing is like-to-like in every consideration, just that it demonstrates there's no "human spirit" required in order to qualify as "transformative" as far as fair use is concerned. Search engine image thumbnailing has been found to be transformative, for instance in Perfect 10, Inc. v. Amazon.com, Inc.: "Google's use of thumbnails is highly transformative."

And, though I'm probably being pedantic here, I think it's important to distinguish that the other fair use factor you allude to is not whether you're "competing against" the original work, but specifically the effect of your use on the market/value of that original work. For example if your documentary uses a clip from a TV show and also happens to air in the same time-slot as that TV show - the extent you compete/displace market for the TV show in general (even as you would had you not included the clip of it) is not what's under consideration, but rather only the additional extent you displace its market specifically due to inclusion of that clip.

Because of that, I'd claim that some machine-learning-based tool that partially displaces the market for a work it was trained on (for instance, Google Translate displacing the market for a translated version of a book) might still be seen reasonably favorably under the market impact factor, so long as the extent it displaces that work is largely independent of whether it has trained on that work specifically (such as if the translation tool could already provide a decent translation of the original book even before having trained on its translated version).

Or maybe they just need a license for their particular use case...
The whole point of the fair use clauses is to protect humans. Clearly we can easily say that programs are altogether exempt in favor of humans, and it would be a proper thing to do, until the first real AI is built.
If corporations owned human slaves and fed them copyrighted materials so that they were inspired to produce original creative output, I don't think that creative output should enjoy legal protections either. Even if slavery were not illegal.

Because the obvious question would be - how can free people compete with that?

The FairTrained models claim to train with only public domain and legal works. Companies are also licensing works. This company has a lawful, foundation model:

https://273ventures.com/kl3m-the-first-legal-large-language-...

So, it's really the majority of companies breaking the law who will be affected. Companies using permissible and licensed works will be fine. The other companies would finally have to buy large collections of content, too. Their billions will have go to something other than GPU's.

I don't know?

Not really sure a claim is good enough. I don't know that you can just go into court and say, "Trust me, I don't use copyrighted material."

And I also can't see any way, other than providing training data and training an identically structured model on that data, that a company can conclusively show that they got the weights in an allegedly copyright free model from the copyright free training data a company provides.

I do hope people are still innocent until proven guilty?

If you did not use copyrighted materials for training, people will not be able to prove that you did, and that should be good enough.

> I do hope people are still innocent until proven guilty?

It's a civil matter not a criminal matter so that that doesn't apply.

While the others are correct, I'm with you in the sense that I don't know if what they claim is true. I've also found others, like one in Singapore, that didn't use it on data that was as legal as news reports claimed. It might turn out to have problems.

There is benefit to using them, though. For one, they've tried really hard to be legal. That sets a positive example, shows good faith if they were sued, and reduces risk for those using them (good faith on our part). Also, one can be sure that they can ditch or replace any outputs in the long term if they're ruled illegal. So, we try not to use the A.I.'s in a way where losing access to them seriously damages our business.

That's the best I can offer until legal reforms happen.

If training, one can train it in Singapore on material you he or she has legal access to. Their law pretty much let's you use anything for AI purposes so long as you legally can access it yourself. To further reduce the risk, they should crawl it themselves, too, taking care to avoid risky sources.

Civil courts work by you proving damages (at least in the USA), not by you going on fishing expeditions because they "might" have done something.

So good luck finding the thing that looks exactly like your copyrighted work that's not in the corpus, if you can yeah, you might be able to prove it.

At the end of the day its like a lot of business, where a liability shell game is played out, and if the chain of evidence cant be drawn quite brightly then lawsuits would be frivolous at best.