Hacker News new | ask | show | jobs
by _zephyr 892 days ago
I do think OpenAI has a point in what they're saying: if we expect human-level competency of AI, it needs to be able to see and train on human-accessible content and ideally with a similar distribution.

For example, I make an open source Firefox web extension for filtering internet content with my own classifier. That literally would not be able to exist without being able to be trained on web content, much of which is copyrighted. Requiring that I somehow either a) use only attributed data or b) detect and not use copyrighted content when trying to build something representative of my source distribution (e.g. the web) sounds like a recipe for a poor outcome. Now maybe my addon isn't your cup of tea - but what if you found out that the next generation of uBlock Origin etc. could not be as effective because of legislation because it wanted to use an AI model? Legislating too heavily around this area will, I believe, have a tremendous chilling effect for small businesses and open source folks trying to innovate in AI.

I've also worked commercially in the creation of two closed source machine learning models, but the domains were restricted enough that web content was not a particularly helpful input. One did all right, and one did not. Seeing bets succeed and fail gives me appreciation for the long-term and uncertain bets that OpenAI has been making for ages finally coming to fruition. I think without businesses being willing to make those bets the GPU-hours would have been hard to pay for.

I've wondered if potentially a different way out of this is not restricting the use of copyrighted material in the training process itself, but rather to instead only consider the created final works. Of course there are thorny problems there, too, but I don't see that having the same chilling effect on research and probably a lesser effect on business as well. One thing I think is clear though: we've reached a tipping point in the US similar to 1998 when the DMCA was legislated where the technology is forcing us to think carefully about what copyright means.

So I have question for those on HN who have meaningfully worked in the creation of not just AI-generated content, but in the creation of some AI model that others use freely or commercially: what seem like promising paths forward here? Or to those working in copyright law (like @williamcotton): how do you see the status quo and potential paths forward?

1 comments

> I do think OpenAI has a point in what they're saying: if we expect human-level competency of AI, it needs to be able to see and train on human-accessible content and ideally with a similar distribution.

Any human who wants to access copyrighted works is by law required to honour the copyright - whether that means purchasing of licenses, timed rental access or immediate cease and desist of usage. Why should AI (and the billion-dollar-backed companies building them) get different treatment?

Or to turn it the other way: if the billion-dollar-backed companies building AI models do get a free pass then surely humans should too?

I think there's an important distinction here though. We access copyrighted material all the time though, as accessing copyrighted material is not always protected in the ways you describe. We view copyrighted images via Google Images, for example. That works because Google stores metadata to point at the content and then loads it. Copyright is (broadly) more about the "not copying" it part.
>We access copyrighted material all the time though, as accessing copyrighted material is not always protected in the ways you describe.

sure, because there's an incentive to attract humans to view such content. It's advertising. Google images isn't built out of goodwill, but is now a target to optimize for to maximize human traffic to get human eyeballs to view ads (or paywalls) for humans consume more products. Having a bot come in ruins that, and the literal billions thrown at adtech to try to verify organic traffic shows that these bots are not desirable metrics for those who pay for ad space.

That perspective from a business lens shows the difference between a human viewing copyright material and a bot. Humans are monetization targets, bots are not. Humans can advertise for you to other humans, bots can't (well... not yet. But do we really want to be talking about ChatGPT ads this early in the LLM era?)