| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by effie 3821 days ago

> "These all seem like parasites to me."

Your impression is wrong. Search engines and other services based on web data provide great value to society. They don't create documents they link to, but they deliver relevant links to people's queries. That's a great service. Without the search engine service, people may not even find the web page. That's why large portion of website owners and webmasters are glad search engine crawlers visit them and even expect indexing to databases to be fast and smooth.

If you publish anything on your web, you're facilitating free use and duplication of it in the whole world. If this was not your intention, but you still published your stuff on your web, you misunderstood the original intent and reality of the Web for sharing information.

There is a widely known standard of communication between robots and web sites called robots.txt standard. It is a file where you can state your intent to restrict crawler downloads. There is also html tag <meta name="robots" content="noindex,nofollow"> that signalizes to crawlers your wish that the page should not appear in search engine results. If you want to prevent people from accessing and using your documents, use these. Both Google and Common Crawl seem to obey them. If you want to _make_sure_ nobody accesses and uses your documents, don't publish them on the Web.

There is no practical way to achieve your documents are accessible only for some limited period you want. If you release them to the world, you always lose control over their distribution and use.