| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by squigz 798 days ago
	I don't think that's how robots.txt or scraping really works. Do you expect scrapers to announce every bot they run? Do you expect webmasters to add a robots rule for every bot? If someone didn't want OpenAI or anyone else scraping their site, whether OpenAI or anyone else announces they're scraping doesn't matter, if they respect robots.txt, and you have rules to catch unannounced scrapers.

1 comments

cuddlyogre 798 days ago

What I'm saying is that it doesn't matter if you disallow them access now, because they've already gotten everything they want, whether you wanted it or not.

The difference between this scraper and other scrapers is that normally, scrapers are usually used for personal or nefarious purposes.

The data scraped for AI models is used explicitly for a commercial purpose by a commercial entity and the original creator received zero compensation or notice that their work was going to be used in a commercial product. The actual rights holders of the works that were used in an unauthorized manner have no way to seek compensation or removal of their work from this commercial product.

There is little material difference between this behavior and if someone downloaded your site and used its content in a book they were selling. It doesn't matter that you discovered this book was printed two years ago. Your work is still being used without your permission.

When the little guy does it, that's called piracy and theft. When billion dollar corporate entities do it, it's called a technological marvel.

squigz 798 days ago

> The difference between this scraper and other scrapers is that normally, scrapers are usually used for personal or nefarious purposes.

This doesn't seem accurate at all. Plenty of businesses are built on scraping data; see: Google.

> The data scraped for AI models is used explicitly for a commercial purpose by a commercial entity and the original creator received zero compensation or notice that their work was going to be used in a commercial product. The actual rights holders of the works that were used in an unauthorized manner have no way to seek compensation or removal of their work from this commercial product.

I think the questions of fair use might keep us busy for hours.

> There is little material difference between this behavior and if someone downloaded your site and used its content in a book they were selling. It doesn't matter that you discovered this book was printed two years ago. Your work is still being used without your permission.

I think a more fair comparison would be if someone used my website as reference/inspiration/etc when writing a book.