Hacker News new | ask | show | jobs
by marginalia_nu 1694 days ago
Bots are one of those things that are easy to build and hard to get right, and there's really no way of preparing for the chaotic reality of real web pages other than fixing the problems as they show up. Weird and unexpected interactions are going to happen. Crawling the real web involves navigating a fractal of unexpected, undocumented and non-standard corner cases. Nobody gets that right on the first try. Because of that I do think we need to be a bit patient with bots.

At the same time, even as someone who runs a web crawler, I have zero qualms about blocking misbehaving bots.

1 comments

I kinda feel like rate limiting your request to individual domains and IP addresses is an easy thing that goes a long way towards getting it right.
There are still snags with that.

Stuff like redirect resolution is very easy to overlook. You may think you're fetching 1 URL per second, but if you are using the wrong tool and you're on a server that has you bouncing around like in a pinball machine and takes you through a dozen redirects for every request, the reality may be closer to 10 requests per second.

On top of that, sometimes the same server has multiple domains. Sometimes the same IP-address serves a large number of servers (maybe it's a CDN).

If you build your site in a way that multiplies each request 10x, well then that's what you get. Don't do that and you won't have issue with requests. Or handle those requests properly. There are solutions to that. You know how many requests your local google CDN gets? They know how to manage load.
Most pages have at least a http->https redirect, many contain a lot of old links to http content.

Usually it's error pages that really drive the large redirect chains. They often have a vibe of like some forgotten stopgap put in place to help with some migration to a version of the site that is no longer in existence.

Of course you don't know it's an error page until you reach the end of the redirect chain.