Hacker News new | ask | show | jobs
by Supply5411 1002 days ago
DMCA 2024 - A nice big report button for when generated content is too close to copyrighted content. It is then on the AI company to supplement the training materials around that content, to dilute the generation of content that could be seen as infringing. So instead of George RR Martin prequels with the same names and characters (because of a lack of training materials), it generates something more generic for the input prompt.

Win/win?

3 comments

The actual complaint is about using their copyrighted works in the training of the LLM without a license. OpenAI is claiming it's fair use, the authors disagree. It's going to take a ruling from a judge to get clarity on the issue, and no matter what it'll be appealed until it hits the SC.
Do we know if OpenAi bought the book, or did they just "accidentally" pirate the book?
That's what discovery will be for, the complaint alleges that the likely source was libgen. Most of these authors haven't released DRM-free ebooks, and it seems unlikely that OpenAI has a large scale book scanning effort (and even if they did, that authors would likely claim that to be infringement itself.)
What if it never accessed the book, but read everything relevant like episode summaries, fan wikis, and forum discussions? It would still be as conversant. Is it still infringement?
oh right was it really proven that they were training on bittorrent book collections?
Or, just let people and computers be inspired

Ideas have never been the scope of copyright and it wasn’t in its democratic mandate. If creatives want that change, fine, advocate for a change of the law

>Ideas have never been the scope of copyright

This isn't about ideas, it's about a specific individuals work given that the reproduced text lifts literal characters out of Martin's book. That has always been covered by IP law. Canonical example, you cannot write a novel about Harry Potter, you can write a book about a wizard going to a magical school.

If a model generates large amounts of text that is very close to something you've written, because there isn't much else like it, how is that "inspired"? It needs more dilution.
We would have to change the law to allow the kind of ‘inspiration’ you are talking about, which is why there are multiple lawsuits here. That’s what OpenAI is asking for - redefinition of ‘fair use’. NNs aren’t copying ideas, they train on what copyright calls ‘fixation’ - they deal with text, audio, and pixels, not ideas. We keep hoping and looking for understanding in the NNs, but we have ample evidence that they don’t actually understand much, if anything, they are just really good at copying in a way that make understanding seem plausible to the layperson.
It’s a good idea to make this easier to report, but… shouldn’t it be on the AI company to train using legally acquired content in the first place? It’d be great if the training data was opt-in and curated. Wouldn’t that be better than a shoot first ask questions later policy? There’s definitely room to improve copyright and room to allow AI to exist, but do we really want to allow AI to ingest all copyrighted material and call it ‘fair use’? That would be giving them a ridiculous and unprecedented amount of freedom to take any and all content and turn around and auto-generate enough to obsolete the people who made the training material. It seems like the race is on to supplant Google as the portal for information, and it does feel like downloading everything in the world and then crying fair use after the fact is wishful thinking that more or less admits to copyright violation.
>shouldn’t it be on the AI company to train using legally acquired content in the first place

I don't think so. It's not illegal to look at or learn from copyrighted materials. If you start producing the materials it becomes a different question. I think the same applies to AI.

Your argument doesn’t work because OpenAI has admitted that ChatGPT is producing copyrighted material. They’re trying to carve an exception for AI, but have already acknowledged that training does copy the materials, literally, and that it does not “learn” from the the same way humans do. The intent with AI may be to remix them, but the whole reason there are multiple lawsuits here (as well as with Stable Diffusion and other NNs) is because they have repeatedly demonstrated they sometimes memorize the training data and can produce it more or less verbatim. They have violated current copyright law. In that light, we have two primary options: change the law, or enforce the current law. OpenAI is hoping to change the law, but whether they have copied some training data and produced it for the output is not even up for debate, this is already the different question you referred to.