| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pklausler 1188 days ago
	The general problem of "AI"s being trained on copyrighted content needs to be discussed more thoroughly, I think.

4 comments

bluefirebrand 1188 days ago

Every time I bring this up, people accuse me of resisting progress, "the cats out of the bag", etc.

It has been frustrating.

link

km3r 1188 days ago

The cat is out of the bag, and I don't see any reason training should be any more controlled than me personally viewing something and 'training' my brain on it. Using either to duplicate copyrighted works is already clearly illegal.

link

angrais 1188 days ago

It is illegal for you to download copyrighted material and distribute it as your own. Models trained on such data can (and are statistically more likely) to produce similar output as their (training) input.

So training must consider licencing where copyright material is used and not consume all data.

Your brain is not a model. You can not reproduce most of what you see. You're not "training" your brain by glancing at an image as your recall concerning that image will be terrible.

link

meh8881 1188 days ago

My brain can certainly recreate something it’s seen before. And it can certainly create something similar to a thing it’s seen before. It’s legal to do the latter and illegal to do the latter. Imperfections on the exact recreations don’t affect the legality of it.

Am I violating copyright law because I am merely capable of producing a copy of something? Obviously not. Why should the model be?

link

antibasilisk 1188 days ago

>It is illegal for you to download copyrighted material and distribute it as your own

I'm sure the millions of people who violate copyright law daily with absolutely no repercussions care very much about that.

link

ClumsyPilot 1188 days ago

Millions of people dont pay taxes and cross the road in the wrong place.

You cant setup a cinema and charge ticket for the movies you stole.

Its the money making side that matters - not individuals ij a private house

link

antibasilisk 1188 days ago

Ok, so then lets violate copyright and open source the effort!

link

Paradigma11 1188 days ago

There will just be checks that make sure that the generated content is not similar enough to violate copyrights of training material and that's it.

link

GolfPopper 1188 days ago

For the same reason that the police being able to have a person look up in a physical printed file who owns a particular car via its license plate is not the same as having a network of cameras and computers that track every car in the city.

link

km3r 1188 days ago

Yeah I don't have any problem with that too. If a cop has a right to see me, he should be legally allow to record me (and in fact would prefer all cop interactions were recorded). A camera + AI allows for massive cost savings on basic police work, enabling police to be more efficient. A camera has a lot less bias than a cop.

link

lupire 1188 days ago

It's because you (and all of us) have a teeny human brain, and these are terrible at remembering things, so the teeny little bits you can remember are protected under Fair Use.

link

anonzzzies 1188 days ago

I think it’s not very hard; if the AI companies believe the data they trained on is public domain/open because they scraped it of the internet, then their trained weights must publicly available as well. They cannot claim ‘but training is expensive’; if they do, then they should pay fees for the hosting and storage and writing time of all data they scraped. I prefer open weights as it’s more practical. Your weights have a sliver of GPL source in it? Well that infected the entire thing as GPL does: it is ours now too!

link

noogle 1188 days ago

The current (legal) answer is "unclear". There are indications that training is fine, but producing and using the generated content is questionable at least. As many IP issues, it will solved only when someone will try that in court and go all the way until a verdict. Some cases are actually being processed but it might take years to get an answer.

link

sampo 1188 days ago

> The general problem of "AI"s being trained on copyrighted content

> The current (legal) answer is "unclear".

European Union was ahead of times for once. The 2019 copyright directive, article 4, makes it legal to scrape the web and make and keep local copies of copyrighted works, for data mining purposes. Unless the copyright holders set up a machine readable exception (such as robots.txt file).

So legal in EU, "unclear" in US.

link

pklausler 1188 days ago

That does not, to me, automatically imply that an "AI" lawfully regurgitating copyrighted content is a "data mining purpose".

link

News-Dog 1188 days ago

Consider that an AI may cite many snippets of copyright publications into a chimera of 'Facts'.

'copyright fair use' : https://copyrightalliance.org/faqs/what-is-fair-use/

link

EamonnMR 1188 days ago

Does OpenAI respect Robots.txt? Do we know?

link

antibasilisk 1188 days ago

Copyright's been dead since the internet was born. I really do think it's the least of our problems when it comes to abstract reasoning engines.

link