| The more things change, the more they stay the same. About 10-15 years ago, the scourge I was fighting was social media monitoring services, companies paid by big brands to watch sentiment across forums and other online communities. I was running a very popular and completely free (and ad-free) discussion forum in my spare time, and their scraping was irritating for two reasons. First, they were monetising my community when I wasn’t. Second, their crawlers would hit the servers as hard as they could, creating real load issues. I kept having to beg our hosting sponsor for more capacity. Once I figured out what was happening, I blocked their user agent. Within a week they were scraping with a generic one. I blocked their IP range; a week later they were back on a different range. So I built a filter that would pseudo-randomly[0] inject company names[1] into forum posts. Then any time I re-identified[2] their bot, I enabled that filter for their requests. The scraping stopped within two days and never came back. -- [0] Random but deterministic based on post ID, so the injected text stayed consistent. [1] I collated a list of around 100 major consumer brands, plus every company name the monitoring services proudly listed as clients on their own websites. [2] This was back around 2009 or so, so things weren't nearly as sophisticated as they are today, both in terms of bots and anti-bot strategies. One of the most effective tools I remember deploying back then was analysis of all HTTP headers. Bots would spoof a browser UA, but almost none would get the full header set right, things like Accept-Encoding or Accept-Language were either absent, or static strings that didn't exactly match what the real browser would ever send. |
I like how this kind of response is very difficult for them to detect when I turn it on, and as a bonus, it pollutes their data. They stopped trying a few days after that.