Hacker News new | ask | show | jobs
by cookiecaper 4544 days ago
Pretty much how you have to do it. Virtually no one is going to agree to allow you to scrape their stuff.

It's just like when you're discussing your entrepreneurial plans with a non-entrepreneur. Typically, their response is an eye roll and a sarcastic "Good luck!". Most people envy successful entrepreneurs really hard, but they have no patience for beginners. It's like they don't understand that there's stuff in between being a broke college kid and having a record-breaking tech IPO. The same is true for copyright infringers and web scrapers; demonstrate your value first, and then open discussions.

If Google strictly abode copyright or computer access laws, it would have never existed because it'd have to ask each website owner for formal permission to crawl their page and store records of their copyrighted material, including derivative works caused by compression, indexing, etc. (and worse the right to reproduce this content for display in search results) and meticulously record each copyright holder's assent. They'd have to verify that the legitimate copyright owner had provided assent, and not an imposter. They wouldn't be able to access anything because almost all boilerplate ToS documents forbid "automated access" and similar, and violating the ToS is illegal computer access.

Google is operating in a highly illegal fashion, and the only reason they're allowed to exist is because they demonstrated their value first by driving traffic to the copyright holders' websites. PayPal did the same thing with banking and payment regulations. At some point, you just have to recognize that most people aren't going to get it until you show them, and don't want to be bothered with your delusions of grandeur. It's better to ask forgiveness than permission.

2 comments

For an even starker example, look at the Google Books case. They scanned thousands of books and made portions of them available online. Authors guild sued them, and it was dismissed because it benefited society to have the works available.
I find Google Books ruling particularly insane... they have undeniably been copying books illegally and for a profit motive (even if they don't yet make money directly off the copies), but this is okay because they've provided a valuable service.

What?!

If it's such a valuable service they could have given a cut to the authors instead of illegally copying their books. Allowing somebody to break the law because you believe they are morally good is textbook legislating from the bench and it's wrong.

Not only that, but now Google has an exclusive license deal to do this, and nobody else can do it.
No, there was a proposed settlement with the Author's Guild that sort of but not really was an exclusive deal with Google. But that settlement was not approved.

That proposed deal did apply only Google, but it didn't preclude anybody else from negotiating their own deal.

> If Google strictly abode copyright or computer access laws, it would have never existed

Whoa there. At the time, the consensus was that anything put up on web servers was there for anyone to see by any means. Minus the 'robots.txt' convention for automated scanning, which existed because there were many other search engines before Google.

Was that really the consensus among the legal community? Somehow I seriously doubt that anyone familiar with copyright law would assert that your copyright was invalidated by publishing on the web. A weak argument for fair use could be employed for crawling the text portions because only a small (often "insignificant") portion of the work is reproduced in human-readable form, but certainly is not applicable when one discusses crawling images.

Is robots.txt a legally-admissible copyright release? There's probably more room to debate that one, but it's not clear-cut. What does it cover? Is it applicable to all crawlers? Can you do a general license release like that without your work effectively becoming public domain? What's the difference between a crawler and a human reader subject to standard copyright terms? What licenses are implicitly granted in a typical robots.txt? It's not like robots.txt is a verbose document that lays all of this out, and all of it is a potential legal problem point.

Also, Google assumes permission by default and only doesn't scan if you explicitly DISALLOW it with robots.txt. This is the opposite of copyright, which reserves a monopoly to the rightsholder unless he explicitly ALLOWS a certain use. It's undeniable that Google is violating copyright law millions of times each and every day, and that said violation is fundamental to their business.

And does any of this negate the computer access laws that make a site's ToS legally binding, even to those who don't formally agree to them? Strictly interpreted, Google would still be behaving illegally even if the copyright element was taken away.

I agree that there were search engines before Google, and that they mostly were in the same problematic legal situation.

And then there's the Internet Archive. IANAL but it seems as if there's effectively this assumption that opt-out makes everything OK even if there's not much if any legal basis to it. (I wrote this piece back in 2005 and not a whole lot has changed: http://bitmason.blogspot.com/2005/07/thoughts-on-wayback-mac...)

To be clear, this is arguably the way that the Web almost has to work--but that doesn't make it all neatly legal under current copyright law.