Hacker News new | ask | show | jobs
by jay-anderson 1341 days ago
We have bots scraping some of our pages at work. We've attempted to reach out but haven't received a response. We don't mind the bots so much themselves, but we want them to be well behaved. Currently they are making calls over and over again that return a 4xx response and are a significant portion of our traffic. We want to request that they stop making bad requests and slow down (we do have throttling in place, but this just gave them more errors to ignore and retry.).

I'd love for an open third-party like this one. It'd even help with prioritizing features that we're missing in our first-party products.

2 comments

> but this just gave them more errors to ignore and retry.

So null-route the offending IPs on a [0]24-hour timeout? The problem you're describing isn't "scraping", it's "low-grade denial-of-service attack (that you suspect might be a result of attempted scraping)", and should be addressed accordingly. (The parenthesised part doesn't really matter.)

0: exponentially increasing up to -, for automated versions, but you're presumably already familiar with the current batch of offending source addresses.

[Too late to edit:]

Also, double check that your first-stage throttling actually increases the latency of the requests, such that a user-agent that doesn't issue multiple requests concurrently (but starts a new request immediately on recieving a response) will automatically self-rate-limit. This should be standard for any 'serious' HTTP server, but I've seen a few that incorrectly go straight from "serve 200 OK instantly" to "serve 429 Too Many Requests, also instantly" rather than "serve 200 OK after ~1 second", and sending 429 only when there are actually too many requests (in particular, more than one at any given time).

Isn't this where you put a stop to existing requests and implement free API keys?