| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by namegulf 79 days ago
	Robots.txt is lame BTW, there is no way to enforce it. It is up to the bot to decide to crawl or not and most cases they don't care. Cloudflare had a nice technic to address the bot problem (if you use their name servers). It'll respect and use the robots.txt while sending the remaining bots to a deep black hole.

3 comments

input_sh 79 days ago

Yes, we know, its purpose is to guide the bots, not forcibly block them.

That said, one of the biggest websites in the world not respecting it is definitely a noteworthy story. Hopefully another one of the biggest websites in the world (formerly known as Twitter) eventually respects it as well instead of not even disclosing itself via a user agent and pretending to be Safari running on iOS.

link

namegulf 79 days ago

Why down vote a comment?

You're talking about one (yes, biggest) but millions of other bots don't follow must be a bigger story.

link

marginalia_nu 79 days ago

Robots.txt is great if you're trying to run an above board operation. Much easier than trying to guess how a webmaster wishes the crawler to behave, and then getting angry emails when you guess wrong.

link

tardedmeme 79 days ago

It's not great. It used to be very common that robots.txt would Disallow *, Allow GoogleBot which just entrenches the search engine monopoly. In response to this other search engines just used the rules for GoogleBot instead of the rules for their own crawlers.

link

marginalia_nu 78 days ago

Eh, not really my experience running an internet search engine and a crawler. It happens occasionally, but mostly people seem to focus on what they perceive as nuisance crawlers if they do disallow any specific UAs.

link

llbbdd 79 days ago

Yeah, robots.txt is a great herald example of the type of solution invented by people who don't understand incentives whatsoever.

link

Ferret7446 79 days ago

robots.txt is a great herald example of people misunderstanding and misusing a tool. The file was designed to help crawlers, by pointing them to the most valuable to index content and help them avoid wasting resources on useless pages.

The people trying to use it to block or limit bots are uninformed and/or misinformed.

link