| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rizzom5000 4980 days ago

Right, and this is why sites like Craigslist explicitly forbid scraping. If the site operators wanted, explicitly, to share their data with you, they would provide an API or give you permission to scrape.

The reality of scraping was really known many years ago. If you're doing if for above-board reasons like for research etc., you'll probably get a pass - if you're doing it in order to profit from someone else's work because you are too lazy to do it yourself, it's probably unethical and you won't get a pass --- these concepts have been around for at least a thousand years or more.

Full Disclosure: I have also scraped data - but only from government websites where the scraped data is explicitly public domain to begin with and APIs were not available.

1 comments

freshhawk 4980 days ago

1. That doesn't address search engines, which are doing it to profit from someone else's work. If you open the door for search engines then how many search engine like things do you give passes to?

2. What if I'm scraping it just for me, because I want a different interface? How many friends can I share that with? Can I open source the program?

3. What if I read a bunch of these sites to do research and write up a story on something about it? Not plagiarizing, just summarizing and providing analysis on craigslist rental prices? What if I do this every day? What if I automate that process? The data is transformed just as much as if I had read it myself and crunched the numbers myself, I made just as many requests to the site as my browser would have.

Concepts that have been around a thousand years or more are not fully applicable. Like the printing press, some things alter the scarcity equation for ideas and data distribution and ownership. Considering how little we've agreed on about print after 500 years I have some doubts that this is as closed an issue as you say.

link

jswanson 4980 days ago

Search engines:

- Respect robots.txt (as mentioned elsewhere) which will often provide a limited subset of all data available

- Give something in return (potential traffic) for the data they reap.

I fully agree that scraping is great, and do it myself frequently. Site operators do have legitimate concerns in some situations though, and it probably comes from feeling as if they are being 'ripped off' somehow.

No one in their right mind is going to object to incidental scraping for personal use.

However, scraping is often scripted into cron or the like and that data is then used to profit someone else. I'm usually cool with that, but if someone is running a web site and they are dependent upon ad revenue to keep the servers running, I understand objecting to it.

link

freshhawk 4980 days ago

Good rules of thumb.

> No one in their right mind is going to object to incidental scraping for personal use.

It would almost certainly involve stripping ads when re-purposing the content.

link

rizzom5000 4980 days ago

Good points, but... 1. I do think it addresses search engines because site operators do explicitly give search engines permission to scrape their sites via something called "robots.txt" files otherwise known as the "robots exclusion standard". 2. Like all other scenarios, this one is also likely between you and the site operator. Are you breaking the site's TOU? The answer to that question might help. If you are asking me for the answer to a moral dilemma, I might suggest that you try Shakespeare for some relevant insight to your question(s). 3. See (2). I believe you are incorrect in your last sentence, and in a number of ways, but feel free to disagree.

link

randomdata 4980 days ago

On #1, you mentioned before that Craigslist disallows scraping, yet unless you are OmniExplorer, it seems scraping is mostly fair game if robots.txt means anything. The robots.txt standard mentions nothing about it being for search engines [2], so there are no special exclusions for search engines specifically.

Additionally, robots.txt is really for automated link traversal, not scrapers in general. If your scraper is initiated by a user, there is no need to follow robots.txt. Not even Google does when the request is user-initiated [3].

From there, the waters just become really murky. Is lynx a scraper because it doesn't render the way most web browsers do? Does it get a pass because it still adheres to web standards? What if a real scraper adheres to web standards? Maybe it is the storage of scraped data that is the issue? What about caches? I could go on, but I'm sure you see what I'm getting at. It's a very complex issue that is not at all understood.

[1] http://www.craigslist.org/robots.txt

[2] http://www.robotstxt.org/robotstxt.html

[3] http://support.google.com/webmasters/bin/answer.py?hl=en&...

link

freshhawk 4980 days ago

Good point. I meant to include a mention of "robots.txt" but I forgot or delete it editing. That's the motivation for number 3. A "robots.txt is the law" philosophy makes some sense to me, but number 3 is an example of a time when I think it falls down. I don't see a distinction between scripting my daily bookmark visits and manually doing it as a meaningful one. What about extensive browser plugins?

This isn't settled legally certainly and it certainly doesn't seem like this is settled ethically either considering the various insane statements that occur when politicians comment on the subject.

Some examples of the specific concepts from a thousand years ago that apply and answer these questions would help me see what you see. I know the basic rules for music sampling and referencing other works when writing and where the line for plagiarism is drawn and the rights for using photography. Don't know the rules for accessing network resources that are open or for using their data.

link

wnight 4980 days ago

If you don't want your data used by others, don't send it to them.

You explicitly give them permission to have it by going out of your way to install a program on a common port, with a common API, and giving it a directory full of documents to distribute, and not using any form of authentication. The way the web works is that answering is equivalent to granting permission to ask and sending a file is tantamount to granting permission. When you receive a file you don't first receive a permissions document, you receive the file - authentication and contractual obligations come first because there is no later. (This is like the tide, you may not like it but that doesn't mean you can change it, especially not with laws.)

You have many ways to check authentication and legally they can be VERY weak, 1-bit passwords are sufficient, but if you don't restrict access it is open - not just because it's the default, but because it's the technical reality: they didn't hack into your computer to get that file, they asked your document server and it gave it to them!

Robots.txt is a suggestion, for the scraper's benefit! It suggests better links. You're allowed to see the rest (the server sends them to you without a password) but you're unlikely to find good content.

If you're afraid of someone examining data you send them, don't send them the data if they ask. Expecting them to not ask, or once they've received it, to not manipulate it in certain ways because you can't then extract a fee for them doing so is controlling and more-over, doomed to fail.

link