Hacker News new | ask | show | jobs
by ChuckMcM 3261 days ago
As someone who once oversaw the operation of a web crawler I can tell you its pretty simple, if it is "Okay" then the robots.txt file will tell you its allowed. If you look at the LinkedIn robots.txt (https://linkedin.com/robots.txt) you will see it is carefully groomed to allow various search engines look through specific sections of their web site, the rest are disallowed.

Pretty much all of the case law comes down as there is a perfectly valid copyright on the 'collection' of a web site regardless of ownership of particular pieces, and the robots.txt is a well known and well understood mechanism for informing 'authorization'

There is a "value" to LinkedIn to letting Google and other search engines crawl them, you get to see pages in your search results pointed at LinkedIn, so LinkedIn lets them crawl their pages.

At the end of the day this is exactly a question of value. Microsoft knows that the collection of information in LinkedIn is valuable for a number of uses, if you want to pay them some of that value to get access to it, fine, if not then don't use it.

Here is one possible outcome; Microsoft will tell them what it will cost to use their info, HiQ will probably not be able to meet it because they've built their existing pricing structure around "free" access, and then as they are going down the drain Microsoft will buy their assets and technology and LinkedIn will get this new service you can buy from them to help you find and retain people.

1 comments

From what I've been told, if the data is factual, such as current employment information, then it doesn't fall under copyright.

Interpretation of that factual data would fall under copyright though.

Kind of, kind of not. It is true that the language of Copyright law calls out 'facts' as something that is not being protected by copyright, how you get the facts has a large bearing on whether or not you can reproduce them.

There is a lot of case law around this stuff as you might imagine. I certainly haven't followed all of it but my interest in information economics has lead me to read fairly extensively about it. And I'm not a lawyer, and especially not a Copyright lawyer so it is entirely possible that everything I have come to know is pure bollocks, consider yourself so warned :-).

Generally in reading about these things there are 'facts' and 'how you got access to them' that come out. There are lots of cases where the "collection" of facts has been upheld to be protected. So for example the "Machinists Handbook" is a collection of facts about machining and the handbook is protected by copyright, even though the specific dimensions of various thread pitches are just 'facts'. Perhaps more interesting has been cases involving national sports leagues against companies and fans who do things like "live tweet" a sports event. They have argued successfully that by buying a ticket to the event you have agreed to the terms of that admission which expressly prohibits you from reproducing those facts in any form. So while it may be a "fact" the Buster Posey just struck out, if you learned of that fact by sitting in AT&T park at a game you can't legally "tweet" it without violating your agreement with Major League Baseball that you agreed to when you bought the ticket.

It has similarly been held (look at a lot of CraigsList vs a bunch of people) that automating access to a web site through scraping is an access that you have to be explicitly allowed. That allowance comes in the terms of service of the web site and is expressed by the robots.txt file (and the available terms of service contracts on the site).

What it boils down to is that the collection of facts in a web site ARE protected by Copyright. Further, in exchange for granting you access to the information, the Copyright owner CAN put restrictions on how you may further use the facts you discover there. If you wish to use the information in a way the Copyright owner objects too, you must get the 'facts' through some other source and not the Copyright owner's collection.

And yes, getting it out of Google's cache of the pages does not count. See the Craiglist vs 3Taps (https://en.wikipedia.org/wiki/Craigslist_Inc._v._3Taps_Inc.) dispute to get a feel for how the court views things. The simplest interpretation I can make from those events was that Google's caching pages counts as fair use (it makes results faster) but people taking the page from Google's cache is either a CFAA or Copyright violation and thus disallowed.