Hacker News new | ask | show | jobs
Is training an AI on public sources breaking copyright? (twitter.com)
4 points by sprynr 1815 days ago
1 comments

The way I look at this is the same way as if an art student goes and looks at 50 surrealist art pieces and then goes and attempts to paint one themselves. That doesn't break copyright and yet they've "trained" themselves by looking at those pieces of art. So, if that is true for an art student, why not an AI?
I agree with you. I think the counter argument could be that the the training data is incorporated into the weights, and therefore some version is being copied. What gets "copied" into the human mind is exempt from copyright because of the long standing precedent that it cant be controlled. I think, like you, training ML models should get the same "exemption" because it's primarily an experiential thing, not making a knock off like copying a video e.g. I think people like the twitter poster see ML benefitting from freely available content in new ways, and are upset that someone is getting some benefit without paying for it. I think the last thing we need is new ways of trying to rent seek.
In general I agree with you. But it gets trickier once you look into more automated content generation. It's been shown already that in natural language generation, as an example, big model frequently regurgitate full text passages. Similarly in computer code generation.