Hacker News new | ask | show | jobs
by gwu78 4655 days ago
Kudos to the WP for ongoing coverage of this case. There are important issues being litigated here that could affect everyone, and I'd argue they are worth discussing without regard to this particular defendant and the sheer stupidity of his actions.

However, I find WP's use of Poulson's activities as an example of "legitimate" automated HTML retrieval ("scraping") to be an odd one. It seems an awkward a comparison to convey what should be a simple point, in my opinion.

How about something much more common? Googlebot. Imagine if we forbade Google from using automation and from scraping content and placing it in the Google cache. No more web search.

Alas, because of the ad hoc nature of the Web (i.e., there is no unifiying organizational scheme for locating content across all websites as there would be in, say, locating content in a library of books), you cannot access Web content until you first discover it. In order to discover content, you generally have to search. In order to create an index and cache of content to search, someone has to scan/crawl/scrape websites. The later three are activities that are routinely automated. As such, they will violate many website Terms of Service and may get you banned simply for being "automated".

In fact, to use Google as an example (not picking on them per se, it's just that they are a well-known example), crawling Google will "get you banned" from using Google, temporarily.

The irony of this has always intrigued me: Google may crawl your servers, but under Google's policies, you may not crawl Google's servers.

If I create an index of your website, at your expense (by aggressively running automated queries against your http server, as Google does, for example), am I obligated to share it with you?

In any event, attempts to criminalize automation should raise red flags with anyone who is even slightly tech savvy.

1 comments

>> The irony of this has always intrigued me: Google may crawl your servers, but under Google's policies, you may not crawl Google's servers.

It looks like some of their site can be crawled and some not, that's how robots.txt has worked for a long time:

http://www.google.com/robots.txt

And search results (the data they have obtained via crawling others' sites) is not among the data that can be crawled.

What are you suggesting?