Hacker News new | ask | show | jobs
by nfhshy68 1403 days ago
I'm surprised anyone actually gives a crap about robots.txt.

Don't put it online if you don't want it crawled. Welcome to the internet.

4 comments

HN does. https://news.ycombinator.com/robots.txt

So does Wikipedia. https://en.wikipedia.org/robots.txt . That points out:

  # enwiki:
  # Folks get annoyed when VfD discussions end up the number 1 google hit for
  # their name. See T6776
  Disallow: /wiki/Wikipedia:Articles_for_deletion/
As does GitHub - https://github.com/robots.txt

Even the Internet Archive, which doesn't honor directives in the robots.txt files, has one - http://archive.org/robots.txt .

Welcome to the internet.

Having one doesn't imply you care about it.

Also, cargo culting has never been a good reason to do anything.

What makes you think Wikipedia doesn't care about robots.txt?

Also, sloths can hold their breath underwater for up to 40 minutes.

Nothing, I don't make assumptions. That's something you do.
You assumed people don't give a crap about robots.txt, otherwise you wouldn't have been surprised.
You don't need to be snarky.

Robots.txt isn't for hiding/suppressing information.

Often times you can have whole URL structures that are redundant with other ones, mainly database-generated pages with all sorts of possible query parameters often disguised as paths. Robots.txt is extremely useful in ensuring crawlers can make life easier for themselves by limiting to the "real" content, as opposed to the redundant stuff. Crawling the 5,000 real pages, not the 500,000 additional URL's that return the same content.

Also for ignoring "interactive" pages like login pages that make zero sense to be crawled.

People "give a crap" about robots.txt because it's useful for that.

In light of the recent surge in scraped content trumping original content on Google search, this is so true... people who scrape your site do not care about your preferences in a txt file.
Yeah, I feel like it only discourages legitimate bots, but for malicious ones it's a big red sign saying "SENSITIVE CONTENT HERE, SCRAPE IT"
While robots.txt is not a good security measure, it's a good tool for preventing pollution of search results with pages like wp_login.php. More to steer users to the subset of pages that are useful to them, away from the pages which are only an implementation detail.

Also, some poorly rate-limited crawlers actually abide by robots.txt, so it's useful to prevent unnecessary load.