Hacker News new | ask | show | jobs
by rpdillon 150 days ago
I see this sentiment posted quite a bit, but have the publishers made any products available that would allow AI training on their works for payment? A naive approach would be to go to an online bookstore and pay $15 for every book, but then you have copyrighted content that is encrypted, that it's a violation of the DMCA to decrypt.

I assume you're expecting that they'll reach out and cut a deal with each publishing house separately, and then those publishing houses will have to somehow transfer their data over to NVIDIA. But that's a very custom set of discussions and deals that have to be struck.

I think they're going to the pirate libraries because the product they want doesn't exist.

6 comments

Perhaps because authors don't want their content to be used for this purpose? Because Microsoft refuses to give me a copy of the source code to Windows to 'inspire' my vibe-coded OS, Windowpanes 12, of which I will not give microsoft a single cent of revenue, its acceptable for me to pirate it? Someone doesn't want to sell me their work, so I'm justified in stealing it?
Oh, I'm certain that authors want to take away every right the reader has. This is easy to see empirically: we used to have the first-sale doctrine with physical books, but every digital platform has made reselling either impossible, or a violation of their terms. And yet, courts have rendered judgements that say training an AI model on a book is fair use if the book was obtained legally, meaning no additional license is needed. You assume the authors' permission is needed, but I'm not sure that assumption holds. I think your argument is mostly emotional, rather than legal.

Your use of the word "stealing" is incorrect, but regardless, I'm not condoning piracy, merely examining the incentives we've set up to lead massive, multi-billion dollar corporations from engaging in it.

Exactly! ALL of the LLM companies are complicit in this bullying and theft and so are the AI grifters on LinkedIn.
> ALL of the LLM companies are complicit in this bullying and theft

I know this comes across as pedantic, but theft doesn't even come into the equation. Thousands and judges and lawyers have examined this, and there is no argument for theft, at all, in any jurisdiction. Why use such lazy language like this? Just to inflame the discussion?

> I assume you're expecting that they'll reach out and cut a deal with each publishing house separately, and then those publishing houses will have to somehow transfer their data over to NVIDIA. But that's a very custom set of discussions and deals that have to be struck.

If this is the only legal way for them to train, then yes that is what they should do instead of breaking the law... just because its not easy doesn't mean piracy is fine.

My comment is being misread as my support for piracy; my comment isn't meant to discuss anything at all about piracy. It's instead intended to look at everything that's not piracy, and examining their costs, and why the industry chose the path they did.

Existing rulings are beginning to suggest that if the books can be obtained legally, a separate license is not required for training. So I'm naturally interested in legal ways folks training models would get a lot of books, and whether the publishing industry has even considered the value there.

Hmm, didn't Anthropic buy a bunch of used books (like, physical ones), scanned them, and then destroyed them? If Anthropic can do that, surely can NVIDIA
Yes! And it was ruled legal by the courts, but the media spun it as "Anthropic destroys a million books to build AI". This is the only legal bulk approach I know of, hence my inquiry about such a product. I didn't expect such a harsh response from some of these comments.
Do you believe in private property rights? If the product they want doesn't exist then they're shit out of luck and they must either make one or wait for one to get made. You're arguing that it's okay for them to break the law because doing business legally is really inconvenient.

That would be the end of discussion if we lived in a world governed by the rule of law but we're repeatedly reminded that we don't.

Not arguing it's ok to break the law, but rather examining their incentives and alternatives, along with their associated costs.
The product i want doesnt exist too. But if I pirate, straight to Alcataraz I go.
Yeah, I wasn't discussing legality, simply the incentives and alternatives.
That's not relevant went it comes to copyright law. The copyright holder has the sole legal right to decide how the work is distributed.

If it isn't distributed in a manner to your liking, the only legal thing you can do is not have a copy of it at all.

I was trying to find out if any product that was legal can bridge that gap other than buying books in print, in bulk, and scanning them and destroying them. From the responses here, it sounds like the answer is a vehement "no".

Wasn't asking for advice on copyright, but since we're here, your statement is slightly too strict, at least with respect to US copyright law. The copyright holder has sole distribution authority over the first sale of the work in the United States, but thereafter the first-sale doctrine allows it to be distributed by anyone thereafter. It is limited to the US, though, as far as I know. This is what allowed anthropic to train on printed books, which they then destroyed: they were able to purchase them in bulk because of the first-sale doctrine, as the publishers and authors would likely try to destroy the first-sale doctrine if they could, as evidenced by what's happened in the world of digital books.