| I run the division at my company that builds crawlers for websites with public records. We scrape this information on-demand when a case is requested, and we handle an enormous volume of different sites (or sources as we call them). We recently passed 700 total custom scrapers. Recently, we have seen a spike in sites that detect, and block our crawlers with some sort of Javascript we cannot identify. We use headless Chrome and selenium to build out most of our integrations, I'm starting to wonder if the science of blocking scraping is getting more popular... I don't think what I'm doing is subversive at all, we're running background checks on people, and we can reduce business costs by eliminating error-prone researchers with smart scrapers that run all day. I don't want to seem like the bad guy here, but what if I wanted to do the opposite of this research? Where do I start? Study the chromium source? Can anyone recommend a few papers? |
I'm curious why you'd jump straight to browser detection as the most likely culprit. When I was doing scraping, the far more common case was bot detection by origin and access patterns. It's just very difficult to make an automated scraper look like a residential or business user.
Where do you run your scraping operation? Is it in AWS or some other hosting provider, because that will get you blocked quickly by a lot of sites? Do you rate limit, including adding random jitter to mimic the way a human might use a browser?
There's scraping services available that essentially use a network of browsers on residential connections with their extension installed to get around scraping detection. It's much slower, but it's much more reliable. We also had some success by signing up with a bunch of the VPN providers (PIA, NordVPN, ExpressVPN, etc) and cycling through their servers frequently. Anything to avoid creating patterns that look automated or being tied to an IP that can be blacklisted. I'd start there before I'd worry about hacky javascript detection like in this story being what's tripping you up.