| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sandworm101 567 days ago
	But all content is DMCA protected. Avoiding copyrighted content means not having content as all material is automatically copyrighted. One would be limited to licensed content, which is another minefield. The apparant loophole is between copyrighted work and copyrighted work that is also registered. But registration can occur at any time, meaning there is little practical difference. Unless you have perfect licenses for all your training data, which nobody does, you have to accept the risk of copyright suits.

2 comments

Xelynega 567 days ago

Yes, that's how every other industry that redistributes content works.

You have to license content you want to use, you cant just use it for free because it's on the internet.

Netflix doesn't just start hosting shows and hope they don't get a copyright suit...

link

YetAnotherNick 564 days ago

In almost all cases before gen AI, scraping was found to be legal unless the bot accepted terms of service, in which case bot is bound by ToS. The biggest and most clear is [1]. People have been scraping internet for as long as internet existed.

[1]: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

link

account42 564 days ago

Before gen AI, scraping mostly wasn't about copyrightable data but about finding facts. Scraping doesn't magically make copyright infringement legal.

link

noitpmeder 566 days ago

It's insane to me that people don't agree that you need to require a license to train your proprietary for-profit model on someone else's work.

link