| HN Mirror

Kind of, kind of not. It is true that the language of Copyright law calls out 'facts' as something that is not being protected by copyright, how you get the facts has a large bearing on whether or not you can reproduce them.

There is a lot of case law around this stuff as you might imagine. I certainly haven't followed all of it but my interest in information economics has lead me to read fairly extensively about it. And I'm not a lawyer, and especially not a Copyright lawyer so it is entirely possible that everything I have come to know is pure bollocks, consider yourself so warned :-).

Generally in reading about these things there are 'facts' and 'how you got access to them' that come out. There are lots of cases where the "collection" of facts has been upheld to be protected. So for example the "Machinists Handbook" is a collection of facts about machining and the handbook is protected by copyright, even though the specific dimensions of various thread pitches are just 'facts'. Perhaps more interesting has been cases involving national sports leagues against companies and fans who do things like "live tweet" a sports event. They have argued successfully that by buying a ticket to the event you have agreed to the terms of that admission which expressly prohibits you from reproducing those facts in any form. So while it may be a "fact" the Buster Posey just struck out, if you learned of that fact by sitting in AT&T park at a game you can't legally "tweet" it without violating your agreement with Major League Baseball that you agreed to when you bought the ticket.

It has similarly been held (look at a lot of CraigsList vs a bunch of people) that automating access to a web site through scraping is an access that you have to be explicitly allowed. That allowance comes in the terms of service of the web site and is expressed by the robots.txt file (and the available terms of service contracts on the site).

What it boils down to is that the collection of facts in a web site ARE protected by Copyright. Further, in exchange for granting you access to the information, the Copyright owner CAN put restrictions on how you may further use the facts you discover there. If you wish to use the information in a way the Copyright owner objects too, you must get the 'facts' through some other source and not the Copyright owner's collection.

And yes, getting it out of Google's cache of the pages does not count. See the Craiglist vs 3Taps (https://en.wikipedia.org/wiki/Craigslist_Inc._v._3Taps_Inc.) dispute to get a feel for how the court views things. The simplest interpretation I can make from those events was that Google's caching pages counts as fair use (it makes results faster) but people taking the page from Google's cache is either a CFAA or Copyright violation and thus disallowed.