| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nfhshy68 1403 days ago
	I'm surprised anyone actually gives a crap about robots.txt. Don't put it online if you don't want it crawled. Welcome to the internet.

4 comments

eesmith 1402 days ago

HN does. https://news.ycombinator.com/robots.txt

So does Wikipedia. https://en.wikipedia.org/robots.txt . That points out:

  # enwiki:
  # Folks get annoyed when VfD discussions end up the number 1 google hit for
  # their name. See T6776
  Disallow: /wiki/Wikipedia:Articles_for_deletion/

As does GitHub - https://github.com/robots.txt

Even the Internet Archive, which doesn't honor directives in the robots.txt files, has one - http://archive.org/robots.txt .

Welcome to the internet.

link

nfhshy68 1402 days ago

Having one doesn't imply you care about it.

Also, cargo culting has never been a good reason to do anything.

link

eesmith 1402 days ago

What makes you think Wikipedia doesn't care about robots.txt?

Also, sloths can hold their breath underwater for up to 40 minutes.

link

nfhshy68 1401 days ago

Nothing, I don't make assumptions. That's something you do.

link

eesmith 1401 days ago

You assumed people don't give a crap about robots.txt, otherwise you wouldn't have been surprised.

link

crazygringo 1402 days ago

You don't need to be snarky.

Robots.txt isn't for hiding/suppressing information.

Often times you can have whole URL structures that are redundant with other ones, mainly database-generated pages with all sorts of possible query parameters often disguised as paths. Robots.txt is extremely useful in ensuring crawlers can make life easier for themselves by limiting to the "real" content, as opposed to the redundant stuff. Crawling the 5,000 real pages, not the 500,000 additional URL's that return the same content.

Also for ignoring "interactive" pages like login pages that make zero sense to be crawled.

People "give a crap" about robots.txt because it's useful for that.

link

tomxor 1403 days ago

In light of the recent surge in scraped content trumping original content on Google search, this is so true... people who scrape your site do not care about your preferences in a txt file.

link

stark98 1402 days ago

Yeah, I feel like it only discourages legitimate bots, but for malicious ones it's a big red sign saying "SENSITIVE CONTENT HERE, SCRAPE IT"

link

yakubin 1402 days ago

While robots.txt is not a good security measure, it's a good tool for preventing pollution of search results with pages like wp_login.php. More to steer users to the subset of pages that are useful to them, away from the pages which are only an implementation detail.

Also, some poorly rate-limited crawlers actually abide by robots.txt, so it's useful to prevent unnecessary load.

link