Right, you got it. Sometimes I feel like we're being attacked by bespoke systems, but it really must be off-the-shelf stuff since our full content is easily licensable. We shouldn't be worth the trouble.
We just weren't getting enough information from Bing or Google Analytics or CloudFlare, and when I developed a realtime activity dashboard, patterns started emerging: distributed web scrapers, registration bots, vulnerability scans, and some of these in tandem (i.e., scans commencing immediately after blocking a block of addresses). And many of these are coming from cloud hosts, Azure being the worst, with Google a close second. This is the type of traffic they don't want you to see, so those respective analytics services just supress it because it would be a negative advertisement if we could actually see what is happening realtime. I compared the numbers - Google was consistently underreporting our traffic by at least 40%, and a lot (not the majority, but enough to be noticible) of that traffic was coming their own hosted servers (not the indexing bots, but the user cloud instances).
CloudFlare implements temporary bans but I needed something permanent for those threats that were recognizable based on their request patterns.
The ARIN squatting is the latest thing I'm seeing - a lot of requests coming from netblocks that are former DoD and RedHat addresses. The publicly available ARIN databases aren't entirely up-to-date and the bad guys know it, some of the checks we depend on have to be taken with a grain of salt.
So far, I've been able to develop business rules to separate out the human activity from the carefully constructed scraper/probe attempts, but I fear that if they get just a bit more sophisticated I may lose that ability.
That is a fair point. It used to be that way, but around 2016 we started noticing bots and scrapers that use that use full webkit implementations to run JavaScript. Those clients should be triggering Google Analytics just like a desktop browser, but it was difficult to make the correlation due to information hiding in the GA dashboard (tuple of IP address, timestamp, resource would be needed, but they do not provide that, so it was impossible to test).
We just weren't getting enough information from Bing or Google Analytics or CloudFlare, and when I developed a realtime activity dashboard, patterns started emerging: distributed web scrapers, registration bots, vulnerability scans, and some of these in tandem (i.e., scans commencing immediately after blocking a block of addresses). And many of these are coming from cloud hosts, Azure being the worst, with Google a close second. This is the type of traffic they don't want you to see, so those respective analytics services just supress it because it would be a negative advertisement if we could actually see what is happening realtime. I compared the numbers - Google was consistently underreporting our traffic by at least 40%, and a lot (not the majority, but enough to be noticible) of that traffic was coming their own hosted servers (not the indexing bots, but the user cloud instances).
CloudFlare implements temporary bans but I needed something permanent for those threats that were recognizable based on their request patterns.
The ARIN squatting is the latest thing I'm seeing - a lot of requests coming from netblocks that are former DoD and RedHat addresses. The publicly available ARIN databases aren't entirely up-to-date and the bad guys know it, some of the checks we depend on have to be taken with a grain of salt.
So far, I've been able to develop business rules to separate out the human activity from the carefully constructed scraper/probe attempts, but I fear that if they get just a bit more sophisticated I may lose that ability.