Hacker News new | ask | show | jobs
by jterrys 919 days ago
I think using data that you don't have the copyrights to train AI is theft.

That being said, Getty is hardly the paragon of goodwill considering they regularly steal from public domain databases, issue DMCA takedown requests of the stolen content from said databases, and then turn around to sell it to unwitting people for a subscription. They own none of the copyrights for what they are doing but have been allowed to get away with it.

2 comments

> I think using data that you don't have the copyrights to train AI is theft.

There are public domain works you can use and copyright doesn't protect ideas. It protects expression of ideas, so getting "just the ideas" without the expression is ok.

Right. Public domain is stuff that doesn't have exclusive IP rights. You can do with that what you want.

The problem is that "expression of ideas" in the realm of AI is akin to plagiarism by human standards, because its a literal copying of the source material blended together. I couldn't recite you the entire plot of the Odyssey off the top of my head literally, but AI can, because it has the source material. We just tell it to do funny ha-ha things so its okay.

Have you only read books you own the copyright to?

What’s the legal distinction between you learning and AI learning?

If I regurgiate something I read in copyrighted book without proper license that also would be theft, no distinction there.

I'm not distributing my brain, at least same (but probably more restrictive) should apply to models - training is okay, but using and distributing should be limited by copyright

Explaining anything publicly based on my understanding I got reading books would be illegal following this logic. I'm not sure this is how it works.
They want to muddle the distinction between ideas and expression. You can't copyright ideas. Everyone is entitled to copy ideas.
It would not be illegal based on fair use (though you have to be careful there also), but if you try to regurgiate large portions of the book then it would be. And we do know that models regurgiate training material verbatim (Copilot)
Redistribution, and the scale of it.

Besides which, "learning" isn't a fair use exemption anyway.