Hacker News new | ask | show | jobs
by kazinator 3950 days ago
Most 404 accesses are from people probing a server for common URL's belonging to vulnerable web applications.

Thus my 404 page is a honeypot, and too many 404 accesses in a short period results in an automatic IP ban.

4 comments

I would reconsider this approach that you have. I run web crawlers for fun and profit and there are several pages that you can assume a website might have, like robots.txt, humans.txt, sitemap.xml, sitemaps/sitemap-index.xml, blog.example.com, blog, etc.

If someone is trying to get to admin.php, sure, ban them. Or if they are not following robots.txt. But sitemaps are not reliable enough sometimes and not all crawlers are meanies.

> I run web crawlers for fun and profit

Not my fun and profit, though.

You don't know that. He could be running a crawler that builds a service that ends up sending you very valuable traffic... Or you could be right... Not really enough info though.
There is plenty of info, namely that the service doesn't exist now, and that we can recognize and allow its crawler if it ever becomes successful. It just doesn't make sense to allow every tom, dick and harry's crawler on the promise of some future benefit. That's like delivering and opening every spam e-mail in case there is a nugget in there.

The idea of saying yes to everything is comically explored by Jim Carey in the film:

https://en.wikipedia.org/wiki/Yes_Man_%28film%29

Hey man if I'm following your robots.txt it's all good right? You have the power to only allow the people that give you benefit:

    User-agent: *
    Disallow: /
    User-agent: Googlebot
    Allow: /
    User-agent: Slurp
    Allow: /
    User-Agent: bingbot
    Allow: /
robots.txt is widely ignored. User-agent fields are faked out to make robots look like Firefox on Windows.

Anyone can make a crawler and then have it report as Googlebot. That doesn't even violate the robots.txt; it says, if your name is Googlebot, you're allowed.

Blocking crap requires cunning: code that looks for suspicious access patterns and responds.

A genuine Googlebot should be operating from a Google domain. If we reverse the client IP of a Googlebot request, we get something in the ".googlebot.com" domain.

I'm just saying: don't guilt me if I'm following robots.txt.
A somewhat-related honeypot that I have seen:

Include a directory "visiting-this-will-ban-you" in your robots.txt that IP bans whomever visits it.

Really depends on the site from what I've seen on large sites that went through 10 years or so of revisions and redesigns that vast majority of 404's come from broken links in the site it self.

Heck when MSFT redesigned technet/MSDN half of the links were dead allot of the Google results for legacy products will still lead you to a 404 page on technet...

Just be sure it's not an ip that many legit users are passing through ... Like in the days when dial up aol users would all pass through the same IP.
All dial-up users being on the same IP would be tremendously convenient on the SMTP defense front, on the other hand.