Hacker News new | ask | show | jobs
by johnnyanmac 454 days ago
From my little understanding, we have a sort of agreement in place with an item called robot.txt that's more or less a hanshake with such scrapers. Of course, the issue is these scrapers are blatantly ignoring robots.txt

A license can help as well, but what's a license without enforcement? These companies are simply treating the courts as a cost to do business.

1 comments

Close, robots.txt was originally for web crawlers, to reduce accidental denial-of-service attacks. It had nothing to do with the scraping (i.e. downloading content and parsing the HTML tags in a programmatic manner).
What do you think a search engine’s crawler bot is doing exactly? I could sure be wrong, but I have a hunch that “downloading content and paraing the HTML tags in a programmatic manner” describes it.
Yes, but the difference is that the term "scraping" also targets things like automatically generating RSS feeds from HTML pages, which is not covered by robots.txt.
I thought robots.txt covered all automated, programmatic access by third parties where a bot slurps stuff and follows links, without splitting hairs about it.

But what do I know, the young whippersnappers will just word lawyer me to death, so I better shut up and go away.