Hacker News new | ask | show | jobs
by Permit 35 days ago
> where infringing copyright is legal as long as you're rich.

This isn’t true. A rich person and a poor person can train LLMs on copyrighted material in 2026. How they acquired those materials matters. Wealthy corporations hold no legal advantage in this space. For example, Anthropic recently settled for $1.5 billion due to acquiring books via piracy: https://www.nytimes.com/2025/09/05/technology/anthropic-sett...

My understanding is that an individual could likely pirate the same books without paying a dime (not due to differing legal standards but simply due to the fact it would be hard to identify them in many jurisdictions). In a practical sense it seems corporations are held to a higher standard in this regard.

The discrepancy is that some people equate training a model with piracy even though they are not the same thing. This is typically due to intellectual laziness (refusal to understand the differences) or willful misrepresentation (due to being an ideologically opposed to generative AI). No need to make such a mistake here though.

2 comments

Of course it's not the same thing -- it's way worse.

The piracy comes first, and it's exactly the same thing. GenAI Corp. can't train models on illicitly obtained media before illicitly obtaining said media. And that very thing is already what private individuals got and get sued for millions over.

The GenAI Corp., having gotten away with that unpunished, then goes on to commit further violations by commercially exploiting the media with neither a license to do so, nor any intentions to pay the rights-holders for their use.

By the media conglomerates' own math, these GenAI companies should all be drowning in lawsuits over kazillions of bajillions of dollars.

> The piracy comes first, and it's exactly the same thing. GenAI Corp. can't train models on illicitly obtained media before illicitly obtaining said media.

My contention is that this is not happening. Most generative AI companies do not source their training data from illegal torrents and the few that do are currently paying for it. Further, I suspect the companies that get away with it today are _smaller_ not larger.

Training data is typically sourced by scraping the publicly available web.

> Of course it's not the same thing -- it's way worse.

Setting aside your own moral standards here, we should at least be able to agree that from a legal standpoint training a model is not copyright infringement.

> A rich person and a poor person can train LLMs on copyrighted material in 2026.

Updating an old adage for the modern age:

“The law, in its majestic equality, forbids rich and poor alike to sleep under bridges, to beg in the streets, and to steal their bread.” ― Anatole France