| > Where do I start? Study the chromium source? I'm curious why you'd jump straight to browser detection as the most likely culprit. When I was doing scraping, the far more common case was bot detection by origin and access patterns. It's just very difficult to make an automated scraper look like a residential or business user. Where do you run your scraping operation? Is it in AWS or some other hosting provider, because that will get you blocked quickly by a lot of sites? Do you rate limit, including adding random jitter to mimic the way a human might use a browser? There's scraping services available that essentially use a network of browsers on residential connections with their extension installed to get around scraping detection. It's much slower, but it's much more reliable. We also had some success by signing up with a bunch of the VPN providers (PIA, NordVPN, ExpressVPN, etc) and cycling through their servers frequently. Anything to avoid creating patterns that look automated or being tied to an IP that can be blacklisted. I'd start there before I'd worry about hacky javascript detection like in this story being what's tripping you up. |
We do have human emulation routines that helped avoid most detection, and that library is decoupled in such a way that we can edit behavior down to the individual site.
Some sites are just so damn good and detecting us and I just don't get it.