|
I don't think "Sharing is good" is true in the real world. If you apply that as a blanket statement, you'll end up in trouble. What is legal is not always ethical. I think there's an interesting story there about how Google is legal, if someone doesn't automatically assume it should be just because it is. The text online isn't always similar to a published text of the past. There is a personal overlap today that changes the rules. Such as this text I'm publishing right now. Forgetting about all the legalities and technicalities, I still feel like it is different than a page published in a book. I still feel like I should have the power to edit or delete it whenever I want in the future, yet Hacker News disagrees and removes my right to modify it, forever capturing it as if it owns it, not me. I still feel like this text is more transitory, where its relevance is mostly right now, and if it were deleted in a month it would be fine, because it's mostly just chit chat. Certainly we could live in a world where everyone has microphones transcribing everything they ever say, which is transmitted to Google, and provided to researchers, where all kinds of uses could emerge. But that's a different world than the one where we've developed rules for today. Right now, I feel like most things I say are in passing, and should not only disappear, but won't spread where someone is capturing and propagating it beyond my control. What control do I have over my text that is in this Common Crawler database? What if it captured information that was considered to be ephemeral in the website's context, and ripped it out of its home where it's now part of this collective publication, where anyone can use it for anything? Sharing could be good in a world where people are not selfish and malicious. But in this one, many people will use whatever data they can get their hands on for selfish and malicious purposes, that do not benefit you, the author, in any way. I bet a large percentage of use for that Common Crawler database was harmful to society, such as for helping spammers generate fake content. |
Your impression is wrong. Search engines and other services based on web data provide great value to society. They don't create documents they link to, but they deliver relevant links to people's queries. That's a great service. Without the search engine service, people may not even find the web page. That's why large portion of website owners and webmasters are glad search engine crawlers visit them and even expect indexing to databases to be fast and smooth.
If you publish anything on your web, you're facilitating free use and duplication of it in the whole world. If this was not your intention, but you still published your stuff on your web, you misunderstood the original intent and reality of the Web for sharing information.
There is a widely known standard of communication between robots and web sites called robots.txt standard. It is a file where you can state your intent to restrict crawler downloads. There is also html tag <meta name="robots" content="noindex,nofollow"> that signalizes to crawlers your wish that the page should not appear in search engine results. If you want to prevent people from accessing and using your documents, use these. Both Google and Common Crawl seem to obey them. If you want to _make_sure_ nobody accesses and uses your documents, don't publish them on the Web.
There is no practical way to achieve your documents are accessible only for some limited period you want. If you release them to the world, you always lose control over their distribution and use.