Hacker News new | ask | show | jobs
by WhiteNoiz3 905 days ago
As I understood it, the legal precedent for generative AI is the same one that allows google to scrape websites in order to index them for search for the common good. Google also can display cached versions of websites which is the original content of those sites. No one is going to say that google is copyright infringement just because it is showing content from other websites verbatim. So I think this is a weak argument. AI would be useless if we had to scrub all cultural references and popular IP's (even not so popular ones).

Personally, I think generative AI should be able to provide links to similar source material in the training data.. This would be the barest way to compensate those who have contributed to training the AI. I don't think generative AI is sustainable in the long term if it ends up killing all the websites/artists that created the original material. Plus I think having sources adds a layer of transparency and aids users in understanding when content is hallucinated vs. not. People should be able to opt out of having their content used for training and be able to confirm that it has been removed for future iterations. Let's be honest that AI companies are just trying to avoid lawsuits by keeping it secret. These are areas where I think regulation can help rather than worrying about doomsday scenarios.

6 comments

> No one is going to say that google is copyright infringement just because it is showing content from other websites verbatim

Journalists [1] and Getty Images [2] did in the past

[1]: https://yro.slashdot.org/story/03/07/14/025216/web-caching-g... [2]: https://www.theguardian.com/technology/2016/apr/27/getty-ima...

And lost, if memory serves.
No, Google agreed to a licensing agreement and removed the direct links to the images.
IMO, this is probably the goal of the NYTimes lawsuits as well
> * I don't think generative AI is sustainable in the long term if it ends up killing all the websites/artists that created the original material. *

This is the elephant in the room. Every tech wave has had its way of cajoling creators into investing time & money to make original material, then the rules changed.

Google, promised reach and new markets for content, it worked. Then they introduced snippets, ads and whole lot of other things to keep visitors on their freeway, while avoiding sending visitors to the original site.

Reddit, Stack Overflow and others, started with gamification (points, badges) & community to incentivize users to contribute original content.

Now AI is shaking up all these approaches. But with each one, the incentive to create original material appears to dwindle, since the returns are becoming less and less.

Like what's the incentive for any professional now, if AI is going to regurgitate their original content, without any upside (i.e. no potential for reach, no gamification, no community, no recognition, etc).

> Google, promised reach and new markets for content, it worked. Then they introduced snippets, ads and whole lot of other things to keep visitors on their freeway, while avoiding sending visitors to the original site.

Afterward came bots that saturated search results with useless SEO barf that pushed content (original and duplicated) so far down that we're coming back to where we started. Content is increasingly unfindable on the web.

I agree with this too.. AI is only going to exacerbate the signal to noise problem on the web.
> I think generative AI should be able to provide links to similar source material in the training data

Except these aren't databases, so that's generally not possible, in the same way that it's not possible for your provide links to the source material it took to write your reply. How much learning led to the weights on your neurons that allowed you to generate that? Where did you learn about using italics and it's effect on how the words would be interpreted? Where did you learn the tone that would be appropriate in this particular forum?

> People should be able to opt out of having their content used for training

Okay... but then, if I write a book should I be able to opt out of you being allowed to read it? What conditions should I be able to put on who can read my work? Religion? Skin colour? People that aren't good at memorizing?

Hopefully the idea of putting limits on who can acquire knowledge sounds absurd to you. Why are those same limits okay if they're on 'what' rather than 'who'?

> AI companies are just trying to avoid lawsuits by keeping it secret

Which has created a barrier to further research. Instead of me and Joe being able to collaborate on research and papers using the same datasets, we now hide our training data lest the luddites come to smash the machines because learning is only okay if not done too well.

> Except these aren't databases, so that's generally not possible

Not directly and not in every case, but it IS possible to use embeddings to link to similar material. People are doing it pretty commonly using the RAG approach and Bard is already providing sources, etc. It may not be perfect, but the onus is on the AI companies to figure out how to do it right not just claim helplessness.

> Okay... but then, if I write a book should I be able to opt out of you being allowed to read it? What conditions should I be able to put on who can read my work?

Sites that don't want to appear in search results or have sensitive info they don't want to get into search engines can use the Robots.txt which is as old as the internet. There are many valid reasons to have mechanisms to prevent something from being included in training data, and I would also argue this is a core feature that is necessary to spur adoption by businesses as we've already seen. Otherwise, I am not sure I understand your reasoning.. people can publish websites and opt to have them excluded from search, the same should apply to AI.

Well said. Extending copyright to control content consumption and learning is a recipe for converting all of our mass media into businesses as abusive and usurious as textbook companies.

This is a power grab by publishers.

No legal precedent has been set as of yet. The "precedent" you describe is the argument AI companies have been using (that training their models on information available on the Internet should be considered "fair use") but whether AI training actually satisfies the four-factor test for fair use remains to be seen.
It's a null question. Training itself is neither publication nor distribution, so copyright can't be relevant at that point. "Fair use" just isn't a concept applicable to training.
Training stores a variation of the source material, which is arguably distribution. And selling the result or selling access to it certainly is. So fair use applies, and hoping a court thinks the process is transformative to count as fair use. Given original material can be spat out, my money is on a court thinking this is about as transformative as a compression algorithm.
Selling the result is where it's on dodgy ground. I disagree about storage though.
Exactly. Framing reading as fair use is a huge and dangerous expansion of copyright.
Storing copyright content itself can sometimes be illegal - like ripping a Bluray. What if these frames are now stored on their servers and go into the training dataset?
The illegal bit of ripping a Blu-ray is circumventing the copy protection, not the storage. At least, that's how I've always understood the effect of the DMCA on the situation.
The ability to provide a reference to the source is the crucial difference here.

I agree that it should be possible to implement that for generative AI, although the training may become significantly more expensive in order to maintain that information, and the AI companies have little interest in doing so. They’ll probably rather try to heuristically assess possible copyright issues after the fact in a post-processing step.

The more interesting question is if copyright holders can claim unauthorized use of their works beyond the case of near-verbatim reproduction, because the works collectively inform the AI in a more general manner.

> They’ll

What if I asked you to list all our source material that led you to use that particular contraction. Heuristics will not do, you must list each.

Can you do it? Do you believe AI should.

> I agree that it should be possible to implement

Those exact words appear in another forum post from 2006:

https://discourse.igniterealtime.org/t/cm-3beta-compression-...

Should you have quoted that as a source for your reply? What if we knew you'd read that post back in 2006, affecting your neurons, then should you?

It might not be too hard to imagine a simple case of a specific topic where you might have some more prominent sources, but even in those cases I believe if you think it through you'll find there was a ton of other sources that led to the weights that allowed you to 'know' the topic.

I believe they should be able to, to the degree that their output can constitute copyright infringement. Obviously, the fewer sources from the training data a given output matches, and the longer the match, the more relevant it is, and the easier it should be. I believe it should be feasible exactly because of that correlation. The examples you present are largely irrelevant to the problem, because they are largely irrelevant to the citing of sources for copyright reasons.
>> Those exact words appear in another forum post from 2006. Should you have quoted that as a source for your reply? What if we knew you'd read that post back in 2006, affecting your neurons, then should you?

> I believe they should be able to, to the degree that their output can constitute copyright infringement.

But not you? The inference behind the AI-violates-copyright movement is that machine obligations should be brought to a parity with our obligations - that AI and you be fully subject to the same copyright overlordship.

I would independently agree that having AI divulge sources could be a good thing.

I do not agree with this attempt to twist copyright into yet another misshapen hammer, so copyright holders can bludgeon out some result they want.

Wonder. Do Cliff Notes have to pay royalties to the underlying material?

Cliff Notes contain quotes, and citations.

Does the cliff note company, when producing Cliff Notes for "Into The Wild", pay royalties to the publisher?

For that matter, does any paper, article, etc.. that may contain a quote from another, have to pay royalties to the source of the quotes?

Cliff’s Notes has a strong fair use claim, because they offer basic criticism and surface-level commentary alongside their summaries.
They also, arguably, add value to the books themselves.