Hacker News new | ask | show | jobs
by rizzom5000 4940 days ago
Good points, but... 1. I do think it addresses search engines because site operators do explicitly give search engines permission to scrape their sites via something called "robots.txt" files otherwise known as the "robots exclusion standard". 2. Like all other scenarios, this one is also likely between you and the site operator. Are you breaking the site's TOU? The answer to that question might help. If you are asking me for the answer to a moral dilemma, I might suggest that you try Shakespeare for some relevant insight to your question(s). 3. See (2). I believe you are incorrect in your last sentence, and in a number of ways, but feel free to disagree.
3 comments

On #1, you mentioned before that Craigslist disallows scraping, yet unless you are OmniExplorer, it seems scraping is mostly fair game if robots.txt means anything. The robots.txt standard mentions nothing about it being for search engines [2], so there are no special exclusions for search engines specifically.

Additionally, robots.txt is really for automated link traversal, not scrapers in general. If your scraper is initiated by a user, there is no need to follow robots.txt. Not even Google does when the request is user-initiated [3].

From there, the waters just become really murky. Is lynx a scraper because it doesn't render the way most web browsers do? Does it get a pass because it still adheres to web standards? What if a real scraper adheres to web standards? Maybe it is the storage of scraped data that is the issue? What about caches? I could go on, but I'm sure you see what I'm getting at. It's a very complex issue that is not at all understood.

[1] http://www.craigslist.org/robots.txt

[2] http://www.robotstxt.org/robotstxt.html

[3] http://support.google.com/webmasters/bin/answer.py?hl=en&...

Good point. I meant to include a mention of "robots.txt" but I forgot or delete it editing. That's the motivation for number 3. A "robots.txt is the law" philosophy makes some sense to me, but number 3 is an example of a time when I think it falls down. I don't see a distinction between scripting my daily bookmark visits and manually doing it as a meaningful one. What about extensive browser plugins?

This isn't settled legally certainly and it certainly doesn't seem like this is settled ethically either considering the various insane statements that occur when politicians comment on the subject.

Some examples of the specific concepts from a thousand years ago that apply and answer these questions would help me see what you see. I know the basic rules for music sampling and referencing other works when writing and where the line for plagiarism is drawn and the rights for using photography. Don't know the rules for accessing network resources that are open or for using their data.

If you don't want your data used by others, don't send it to them.

You explicitly give them permission to have it by going out of your way to install a program on a common port, with a common API, and giving it a directory full of documents to distribute, and not using any form of authentication. The way the web works is that answering is equivalent to granting permission to ask and sending a file is tantamount to granting permission. When you receive a file you don't first receive a permissions document, you receive the file - authentication and contractual obligations come first because there is no later. (This is like the tide, you may not like it but that doesn't mean you can change it, especially not with laws.)

You have many ways to check authentication and legally they can be VERY weak, 1-bit passwords are sufficient, but if you don't restrict access it is open - not just because it's the default, but because it's the technical reality: they didn't hack into your computer to get that file, they asked your document server and it gave it to them!

Robots.txt is a suggestion, for the scraper's benefit! It suggests better links. You're allowed to see the rest (the server sends them to you without a password) but you're unlikely to find good content.

If you're afraid of someone examining data you send them, don't send them the data if they ask. Expecting them to not ask, or once they've received it, to not manipulate it in certain ways because you can't then extract a fee for them doing so is controlling and more-over, doomed to fail.