Hacker News new | ask | show | jobs
by namegulf 28 days ago
Robots.txt is lame BTW, there is no way to enforce it. It is up to the bot to decide to crawl or not and most cases they don't care.

Cloudflare had a nice technic to address the bot problem (if you use their name servers). It'll respect and use the robots.txt while sending the remaining bots to a deep black hole.

3 comments

Yes, we know, its purpose is to guide the bots, not forcibly block them.

That said, one of the biggest websites in the world not respecting it is definitely a noteworthy story. Hopefully another one of the biggest websites in the world (formerly known as Twitter) eventually respects it as well instead of not even disclosing itself via a user agent and pretending to be Safari running on iOS.

Why down vote a comment?

You're talking about one (yes, biggest) but millions of other bots don't follow must be a bigger story.

Robots.txt is great if you're trying to run an above board operation. Much easier than trying to guess how a webmaster wishes the crawler to behave, and then getting angry emails when you guess wrong.
It's not great. It used to be very common that robots.txt would Disallow *, Allow GoogleBot which just entrenches the search engine monopoly. In response to this other search engines just used the rules for GoogleBot instead of the rules for their own crawlers.
Eh, not really my experience running an internet search engine and a crawler. It happens occasionally, but mostly people seem to focus on what they perceive as nuisance crawlers if they do disallow any specific UAs.
Yeah, robots.txt is a great herald example of the type of solution invented by people who don't understand incentives whatsoever.
robots.txt is a great herald example of people misunderstanding and misusing a tool. The file was designed to help crawlers, by pointing them to the most valuable to index content and help them avoid wasting resources on useless pages.

The people trying to use it to block or limit bots are uninformed and/or misinformed.