Hacker News new | ask | show | jobs
by johngd 4220 days ago
My main focus for the entirety of my career has been on internet facing consumer web applications. I have seen many, many, DOS attacks from IRC bots to Ukrainian web scrapers to Chinese get-lucky wordpress exploit scanners. Most of these can be ignored and blocked with little effort.

By FAR the most annoying of any of these is when Google, Bing and/or Yahoo decide to wake up and crawl your infrastructure with little regard to your robots.txt or webmaster settings, if available. I think they have got better in recent years, but they used to be the absolute worst. It came down to: Let us DOS you, or have your ranking suffer. Suing Google, Bing, Yahoo isn't exactly an option.

Some context: I was the lead architect/engineer combo for a CMS that hosted ~500k domains for a fairly large international company. Some days I could login and see them crawling every domain from A-Z. Some days I would get caught by Google and Bing at the same time. They were the largest consumers of data on this system.

2 comments

FWIW, every time I've seen what looked like a major search engine ignoring rate-limits (either Crawl-Delay or webmaster tools settings) a check of the actual IPs being used showed that it was someone spoofing a well-known User-Agent, which left you needing some other form of rate-limiting either way.
Very true, however the incidents I reference this was definitely not the case.

In fact for a while we would get Bing (MSN bot back then) crawl us everyday at the same time, almost on the dot.

Let me plug project honeypot (which I am in no way affiliated with). This is truly an awesome, and surprisingly accurate, free, service that does an amazing job at collecting heuristics on suspicious IP activity and exposing it in a easy to interpret way..

http://www.projecthoneypot.org/index.php

Project Honeypot is indeed great – I've been running their collectors for years on a few spare domains, racking up a fair number of harvesters.
One thing that is often a surprise to webmasters is that if you have 100+ websites on a single IP, robots.txt specifies a crawl-delay per website and not per IP. So you can end up with 100x the crawling you thought you were specifying in robots.txt.
This is very true. Logistically (not technically) this wasn't an option - read: business decision. Before the days where everything had an API, and things were done with spreadsheets, and billing codes, and product codes, and typos... managing DNS for a large number of domains was not fun and rather expensive. Want to move a bunch of domains to a new IP address? $$$+Time + more time and money. We weren't able to trust our domain broker and had to double check EVERYTHING they did.

To you point, if you are a large provider, especially one that passively and actively sends a lot of money towards the search engines, there are some additional options at your disposal. We (the business units) had contacts with adsense etc, which would come in handy.