| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kazinator 3950 days ago
	Most 404 accesses are from people probing a server for common URL's belonging to vulnerable web applications. Thus my 404 page is a honeypot, and too many 404 accesses in a short period results in an automatic IP ban.

4 comments

3pt14159 3950 days ago

I would reconsider this approach that you have. I run web crawlers for fun and profit and there are several pages that you can assume a website might have, like robots.txt, humans.txt, sitemap.xml, sitemaps/sitemap-index.xml, blog.example.com, blog, etc.

If someone is trying to get to admin.php, sure, ban them. Or if they are not following robots.txt. But sitemaps are not reliable enough sometimes and not all crawlers are meanies.

link

kazinator 3950 days ago

> I run web crawlers for fun and profit

Not my fun and profit, though.

link

bkmartin 3950 days ago

You don't know that. He could be running a crawler that builds a service that ends up sending you very valuable traffic... Or you could be right... Not really enough info though.

link

kazinator 3950 days ago

There is plenty of info, namely that the service doesn't exist now, and that we can recognize and allow its crawler if it ever becomes successful. It just doesn't make sense to allow every tom, dick and harry's crawler on the promise of some future benefit. That's like delivering and opening every spam e-mail in case there is a nugget in there.

The idea of saying yes to everything is comically explored by Jim Carey in the film:

https://en.wikipedia.org/wiki/Yes_Man_%28film%29

link

3pt14159 3950 days ago

Hey man if I'm following your robots.txt it's all good right? You have the power to only allow the people that give you benefit:

    User-agent: *
    Disallow: /
    User-agent: Googlebot
    Allow: /
    User-agent: Slurp
    Allow: /
    User-Agent: bingbot
    Allow: /

link

kazinator 3950 days ago

robots.txt is widely ignored. User-agent fields are faked out to make robots look like Firefox on Windows.

Anyone can make a crawler and then have it report as Googlebot. That doesn't even violate the robots.txt; it says, if your name is Googlebot, you're allowed.

Blocking crap requires cunning: code that looks for suspicious access patterns and responds.

A genuine Googlebot should be operating from a Google domain. If we reverse the client IP of a Googlebot request, we get something in the ".googlebot.com" domain.

link

3pt14159 3949 days ago

I'm just saying: don't guilt me if I'm following robots.txt.

link

kkl 3950 days ago

A somewhat-related honeypot that I have seen:

Include a directory "visiting-this-will-ban-you" in your robots.txt that IP bans whomever visits it.

link

dogma1138 3950 days ago

Really depends on the site from what I've seen on large sites that went through 10 years or so of revisions and redesigns that vast majority of 404's come from broken links in the site it self.

Heck when MSFT redesigned technet/MSDN half of the links were dead allot of the Google results for legacy products will still lead you to a 404 page on technet...

link

taf2 3950 days ago

Just be sure it's not an ip that many legit users are passing through ... Like in the days when dial up aol users would all pass through the same IP.

link

kazinator 3950 days ago

All dial-up users being on the same IP would be tremendously convenient on the SMTP defense front, on the other hand.

link