| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kazinator 3950 days ago
	> I run web crawlers for fun and profit Not my fun and profit, though.

2 comments

bkmartin 3950 days ago

You don't know that. He could be running a crawler that builds a service that ends up sending you very valuable traffic... Or you could be right... Not really enough info though.

link

kazinator 3950 days ago

There is plenty of info, namely that the service doesn't exist now, and that we can recognize and allow its crawler if it ever becomes successful. It just doesn't make sense to allow every tom, dick and harry's crawler on the promise of some future benefit. That's like delivering and opening every spam e-mail in case there is a nugget in there.

The idea of saying yes to everything is comically explored by Jim Carey in the film:

https://en.wikipedia.org/wiki/Yes_Man_%28film%29

link

3pt14159 3950 days ago

Hey man if I'm following your robots.txt it's all good right? You have the power to only allow the people that give you benefit:

    User-agent: *
    Disallow: /
    User-agent: Googlebot
    Allow: /
    User-agent: Slurp
    Allow: /
    User-Agent: bingbot
    Allow: /

link

kazinator 3950 days ago

robots.txt is widely ignored. User-agent fields are faked out to make robots look like Firefox on Windows.

Anyone can make a crawler and then have it report as Googlebot. That doesn't even violate the robots.txt; it says, if your name is Googlebot, you're allowed.

Blocking crap requires cunning: code that looks for suspicious access patterns and responds.

A genuine Googlebot should be operating from a Google domain. If we reverse the client IP of a Googlebot request, we get something in the ".googlebot.com" domain.

link

3pt14159 3949 days ago

I'm just saying: don't guilt me if I'm following robots.txt.

link