|
|
|
|
|
by rizzom5000
4940 days ago
|
|
Good points, but...
1. I do think it addresses search engines because site operators do explicitly give search engines permission to scrape their sites via something called "robots.txt" files otherwise known as the "robots exclusion standard".
2. Like all other scenarios, this one is also likely between you and the site operator. Are you breaking the site's TOU? The answer to that question might help. If you are asking me for the answer to a moral dilemma, I might suggest that you try Shakespeare for some relevant insight to your question(s).
3. See (2).
I believe you are incorrect in your last sentence, and in a number of ways, but feel free to disagree. |
|
Additionally, robots.txt is really for automated link traversal, not scrapers in general. If your scraper is initiated by a user, there is no need to follow robots.txt. Not even Google does when the request is user-initiated [3].
From there, the waters just become really murky. Is lynx a scraper because it doesn't render the way most web browsers do? Does it get a pass because it still adheres to web standards? What if a real scraper adheres to web standards? Maybe it is the storage of scraped data that is the issue? What about caches? I could go on, but I'm sure you see what I'm getting at. It's a very complex issue that is not at all understood.
[1] http://www.craigslist.org/robots.txt
[2] http://www.robotstxt.org/robotstxt.html
[3] http://support.google.com/webmasters/bin/answer.py?hl=en&...