Hacker News new | ask | show | jobs
by 3pt14159 3954 days ago
Hey man if I'm following your robots.txt it's all good right? You have the power to only allow the people that give you benefit:

    User-agent: *
    Disallow: /
    User-agent: Googlebot
    Allow: /
    User-agent: Slurp
    Allow: /
    User-Agent: bingbot
    Allow: /
1 comments

robots.txt is widely ignored. User-agent fields are faked out to make robots look like Firefox on Windows.

Anyone can make a crawler and then have it report as Googlebot. That doesn't even violate the robots.txt; it says, if your name is Googlebot, you're allowed.

Blocking crap requires cunning: code that looks for suspicious access patterns and responds.

A genuine Googlebot should be operating from a Google domain. If we reverse the client IP of a Googlebot request, we get something in the ".googlebot.com" domain.

I'm just saying: don't guilt me if I'm following robots.txt.