| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kazinator 3950 days ago

robots.txt is widely ignored. User-agent fields are faked out to make robots look like Firefox on Windows.

Anyone can make a crawler and then have it report as Googlebot. That doesn't even violate the robots.txt; it says, if your name is Googlebot, you're allowed.

Blocking crap requires cunning: code that looks for suspicious access patterns and responds.

A genuine Googlebot should be operating from a Google domain. If we reverse the client IP of a Googlebot request, we get something in the ".googlebot.com" domain.

1 comments

3pt14159 3949 days ago

I'm just saying: don't guilt me if I'm following robots.txt.

link