| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by johnnyanmac 454 days ago
	From my little understanding, we have a sort of agreement in place with an item called robot.txt that's more or less a hanshake with such scrapers. Of course, the issue is these scrapers are blatantly ignoring robots.txt A license can help as well, but what's a license without enforcement? These companies are simply treating the courts as a cost to do business.

1 comments

CaptainFever 454 days ago

Close, robots.txt was originally for web crawlers, to reduce accidental denial-of-service attacks. It had nothing to do with the scraping (i.e. downloading content and parsing the HTML tags in a programmatic manner).

link

WesolyKubeczek 454 days ago

What do you think a search engine’s crawler bot is doing exactly? I could sure be wrong, but I have a hunch that “downloading content and paraing the HTML tags in a programmatic manner” describes it.

link

CaptainFever 454 days ago

Yes, but the difference is that the term "scraping" also targets things like automatically generating RSS feeds from HTML pages, which is not covered by robots.txt.

link

WesolyKubeczek 454 days ago

I thought robots.txt covered all automated, programmatic access by third parties where a bot slurps stuff and follows links, without splitting hairs about it.

But what do I know, the young whippersnappers will just word lawyer me to death, so I better shut up and go away.

link