Hacker News new | ask | show | jobs
by sandworm101 567 days ago
But all content is DMCA protected. Avoiding copyrighted content means not having content as all material is automatically copyrighted. One would be limited to licensed content, which is another minefield.

The apparant loophole is between copyrighted work and copyrighted work that is also registered. But registration can occur at any time, meaning there is little practical difference. Unless you have perfect licenses for all your training data, which nobody does, you have to accept the risk of copyright suits.

2 comments

Yes, that's how every other industry that redistributes content works.

You have to license content you want to use, you cant just use it for free because it's on the internet.

Netflix doesn't just start hosting shows and hope they don't get a copyright suit...

In almost all cases before gen AI, scraping was found to be legal unless the bot accepted terms of service, in which case bot is bound by ToS. The biggest and most clear is [1]. People have been scraping internet for as long as internet existed.

[1]: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

Before gen AI, scraping mostly wasn't about copyrightable data but about finding facts. Scraping doesn't magically make copyright infringement legal.
It's insane to me that people don't agree that you need to require a license to train your proprietary for-profit model on someone else's work.