Hacker News new | ask | show | jobs
by kazinator 3950 days ago
> I run web crawlers for fun and profit

Not my fun and profit, though.

2 comments

You don't know that. He could be running a crawler that builds a service that ends up sending you very valuable traffic... Or you could be right... Not really enough info though.
There is plenty of info, namely that the service doesn't exist now, and that we can recognize and allow its crawler if it ever becomes successful. It just doesn't make sense to allow every tom, dick and harry's crawler on the promise of some future benefit. That's like delivering and opening every spam e-mail in case there is a nugget in there.

The idea of saying yes to everything is comically explored by Jim Carey in the film:

https://en.wikipedia.org/wiki/Yes_Man_%28film%29

Hey man if I'm following your robots.txt it's all good right? You have the power to only allow the people that give you benefit:

    User-agent: *
    Disallow: /
    User-agent: Googlebot
    Allow: /
    User-agent: Slurp
    Allow: /
    User-Agent: bingbot
    Allow: /
robots.txt is widely ignored. User-agent fields are faked out to make robots look like Firefox on Windows.

Anyone can make a crawler and then have it report as Googlebot. That doesn't even violate the robots.txt; it says, if your name is Googlebot, you're allowed.

Blocking crap requires cunning: code that looks for suspicious access patterns and responds.

A genuine Googlebot should be operating from a Google domain. If we reverse the client IP of a Googlebot request, we get something in the ".googlebot.com" domain.

I'm just saying: don't guilt me if I'm following robots.txt.