| HN Mirror

Hey. I run a small community forum and I've been dealing with this exact same kind of behaviour where well over 99% of requests are bad crawlers. There used to be plenty of "tells" for the faked browsers, HTTP/1.1 being a huge one. As you said, however, they're getting a bit smarter about that and it's becoming increasingly difficult to differentiate it from legitimate traffic.

It's been getting worse over the past year, with the past few weeks in particular seeing a massive change literally overnight. I had to aggressively tune my WAF rules to even remotely get things under control. With Cloudflare I'm aggressively issuing browser challenges to any browser that looks remotely suspicious, and the pass rate is currently below 0.5%. For my users' sake, a successful browser challenge is "valid" for over a month, but this still feels like another thing that'll eventually be bypassed.

I'd be keen to know if you've found any other effective ways of mitigating these most recent aggressive scraping requests. Even a simple "yes" or "no" would be appreciated; I think it's fair to be apprehensive about sharing some specific details publicly since even a lot of folks here on HN seem to think it's their right to scrape content with orders of magnitude higher throughput than all users combined.

I really don't know how this is sustainable long-term. It's eaten up quite a lot of my personal time and effort just for the sake of a hobby that I otherwise greatly enjoy.