| Common Crawl contains the HTML? I wonder how this is legal and considered acceptable. I wish I knew how even Google and others get away with scraping content, saving it, and utilizing it for profit without sharing any revenue with the original webmasters. I know people can opt out of crawling, for those that actually respect that. But still, am I the only one who feels like this is wrong? I guess I have this view that your domain is yours, and you invite the public in like an open house. It's my house, my property, and the door is open, where people can come in and look around at my stuff. But the expectation is that only locals will arrive, in small number, and they'll be good guests. If someone is breaking the lock on the bedroom door and going through the private drawers, that's wrong. If someone is taking photographs of everything, to then create a virtual tour of my house they charge for, that's wrong. The expectation is you're being nice by providing free and open access to information you created and own, and people should behave courteous to that. Then if you as the webmaster choose, you can provide an API, or database dumps for people to download, along with the licensing terms. That is when it feels right for people to do things like this with the data, because you intentionally provided it through a non-personal interface. To me the web is still a personal interface. I expect humans to use it, in an ordinary human-like way where it is somewhat ephemeral and courteous. I feel like Google cheated their way to success, and Common Crawl is stealing to rise their position in an unfair similar manner. These all seem like parasites to me. They didn't create anything, they just steal it en masse. There's so many businesses like that, such as Domain Tools that gets rich by hoarding everyone's contact details from WHOIS:
http://whois.domaintools.com/commoncrawl.org They have a screenshot history they won't ever delete even if you ask nicely. Here is a picture of Common Crawl from 2011:
http://thumbnails.domaintools.com/domaintools/2016-01-08T19:... |
Well, here's the answer: "transformative" reuse of content is explicitly permitted under copyright law. Simply reproducing the content and charging for it would not fall under this provision, but building an archive of publicly available information is - quite appropriately, permissible.
There was recently a very large court case regarding this principle and its application to Google Books. Google won, by demonstrating that their search index is not equivalent to and does not affect the market for the original work - a "transformative" use.
Sharing is good. Publicly available works achieve their aims only by being consumed by others - anyone who publishes a work free of charge should expect it to be, and remain, publicly accessible.