Hacker News new | ask | show | jobs
by mei0Iesh 3818 days ago
Common Crawl contains the HTML?

I wonder how this is legal and considered acceptable. I wish I knew how even Google and others get away with scraping content, saving it, and utilizing it for profit without sharing any revenue with the original webmasters.

I know people can opt out of crawling, for those that actually respect that. But still, am I the only one who feels like this is wrong?

I guess I have this view that your domain is yours, and you invite the public in like an open house. It's my house, my property, and the door is open, where people can come in and look around at my stuff. But the expectation is that only locals will arrive, in small number, and they'll be good guests. If someone is breaking the lock on the bedroom door and going through the private drawers, that's wrong. If someone is taking photographs of everything, to then create a virtual tour of my house they charge for, that's wrong. The expectation is you're being nice by providing free and open access to information you created and own, and people should behave courteous to that.

Then if you as the webmaster choose, you can provide an API, or database dumps for people to download, along with the licensing terms. That is when it feels right for people to do things like this with the data, because you intentionally provided it through a non-personal interface.

To me the web is still a personal interface. I expect humans to use it, in an ordinary human-like way where it is somewhat ephemeral and courteous. I feel like Google cheated their way to success, and Common Crawl is stealing to rise their position in an unfair similar manner.

These all seem like parasites to me. They didn't create anything, they just steal it en masse.

There's so many businesses like that, such as Domain Tools that gets rich by hoarding everyone's contact details from WHOIS: http://whois.domaintools.com/commoncrawl.org

They have a screenshot history they won't ever delete even if you ask nicely. Here is a picture of Common Crawl from 2011: http://thumbnails.domaintools.com/domaintools/2016-01-08T19:...

2 comments

> I wonder how this is legal and considered acceptable. I wish I knew how Google and others gets away with scraping content [...]

Well, here's the answer: "transformative" reuse of content is explicitly permitted under copyright law. Simply reproducing the content and charging for it would not fall under this provision, but building an archive of publicly available information is - quite appropriately, permissible.

There was recently a very large court case regarding this principle and its application to Google Books. Google won, by demonstrating that their search index is not equivalent to and does not affect the market for the original work - a "transformative" use.

Sharing is good. Publicly available works achieve their aims only by being consumed by others - anyone who publishes a work free of charge should expect it to be, and remain, publicly accessible.

I believe the Internet Archive is much less clear in this regard than public search engines. IA doesn't even have a clear takedown policy and no webmaster tools in place to give owners control on the archived content. Their crowler does obey to robots.txt rules but if you want content to be removed permanently you have to ask politely by email and in my experience they simply block the site urls from being searched but they don't make clear at all if content was actually removed from their servers.
I think the idea of intentionally deleting content is pretty foreign to the Internet Archive. They're more likely to say "welcome to oblivion!" and set a timer for 70 years to show the content again.
I don't think "Sharing is good" is true in the real world. If you apply that as a blanket statement, you'll end up in trouble.

What is legal is not always ethical. I think there's an interesting story there about how Google is legal, if someone doesn't automatically assume it should be just because it is.

The text online isn't always similar to a published text of the past. There is a personal overlap today that changes the rules. Such as this text I'm publishing right now. Forgetting about all the legalities and technicalities, I still feel like it is different than a page published in a book. I still feel like I should have the power to edit or delete it whenever I want in the future, yet Hacker News disagrees and removes my right to modify it, forever capturing it as if it owns it, not me. I still feel like this text is more transitory, where its relevance is mostly right now, and if it were deleted in a month it would be fine, because it's mostly just chit chat.

Certainly we could live in a world where everyone has microphones transcribing everything they ever say, which is transmitted to Google, and provided to researchers, where all kinds of uses could emerge. But that's a different world than the one where we've developed rules for today. Right now, I feel like most things I say are in passing, and should not only disappear, but won't spread where someone is capturing and propagating it beyond my control.

What control do I have over my text that is in this Common Crawler database? What if it captured information that was considered to be ephemeral in the website's context, and ripped it out of its home where it's now part of this collective publication, where anyone can use it for anything?

Sharing could be good in a world where people are not selfish and malicious. But in this one, many people will use whatever data they can get their hands on for selfish and malicious purposes, that do not benefit you, the author, in any way. I bet a large percentage of use for that Common Crawler database was harmful to society, such as for helping spammers generate fake content.

> "These all seem like parasites to me."

Your impression is wrong. Search engines and other services based on web data provide great value to society. They don't create documents they link to, but they deliver relevant links to people's queries. That's a great service. Without the search engine service, people may not even find the web page. That's why large portion of website owners and webmasters are glad search engine crawlers visit them and even expect indexing to databases to be fast and smooth.

If you publish anything on your web, you're facilitating free use and duplication of it in the whole world. If this was not your intention, but you still published your stuff on your web, you misunderstood the original intent and reality of the Web for sharing information.

There is a widely known standard of communication between robots and web sites called robots.txt standard. It is a file where you can state your intent to restrict crawler downloads. There is also html tag <meta name="robots" content="noindex,nofollow"> that signalizes to crawlers your wish that the page should not appear in search engine results. If you want to prevent people from accessing and using your documents, use these. Both Google and Common Crawl seem to obey them. If you want to _make_sure_ nobody accesses and uses your documents, don't publish them on the Web.

There is no practical way to achieve your documents are accessible only for some limited period you want. If you release them to the world, you always lose control over their distribution and use.

A lot of this kind of thing in the "real world" is managed by social convention. People understand that there is a difference based on context, that can not fully be captured by the law. For example it may be perfectly legal to take photos of strangers at the beach, but we all understand why that is creepy.

The thing is that on the world wide web the social convention is strongly in favour of being able to slurp up data, at least as long as it does not cause technical problems. Mostly people get this and understand it.

Other apps have emerged that follow different social conventions. For example if you share something on SnapChat you are suggesting that the information should be ephemeral. But you can't expect people/crawlers to infer the context without having that strong hint.