I used to scrape back in the day when it was easy (literally just make a request and parse html). Seems cloudflare checkboxes / human verification are very commonplace nowdays. Curious how(/if) web scrapers get around those?
1. Clicking the box programmatically – possible but inconsistent
2. Outsourcing the task to one of the many CAPTCHA-solving services (2Captcha etc) – better
3. Using a pool of reliable IP addresses so you don't encounter checkboxes or turnstiles – best
I run a web scraping startup (https://simplescraper.io) and this is usually the approach[0]. It has become more difficult, and I think a lot of the AI crawlers are peeing in the pool with aggressive scraping, which is making the web a little bit worse for everyone.
[0] Worth mentioning that once you're "in" past the captcha, a smart scraper will try to use fetch to access more pages on the same domain so you only need to solve a fraction of possible captchas.
First time hearing of the fetch() approach! If I understand correctly, regular browser automation might typically involve making separate GET requests for each page. Whereas the fetch() strategy involves making a GET for the first page (just as with regular browser automation), then after satisfying cloudflare, rather than going on to the next GET request, use fetch(<url>) to retrieve the rest of the pages you're after.
This approach is less noisy/impact on the server and therefore less likely to get noticed by bot detection.
This is fascinating stuff. (I'd previously used very little javascript in scrapes, preferring ruby, R, or python but this may tilt my tooling preferences toward using more js)
Almost. I mean it's not like fetch(..) is going to lead to some esoteric kind of HTTP request method. I am guessing parent comment is saying what it is saying because fetch will utilize the cookies and other crumbs set by the successful completion of the captcha. If you can take all those crumbs and include it in your next GET request, you don't need to resort to utilizing fetch.
Scammers will use fingerprints from their victims browser/IP/geolocation to try and impersonate them, you basically can buy not only stolen credentials but also the environment in which to run them -safely- from such vendors
Low effort baseline would be https://seleniumbase.io/, to drive a preconfigured web browser that looks relatively human to the network service. Typically it just clicks through the one-click captcha:s.
If that's not good enough you'll likely have to fiddle with your own web driver and possibly a computer vision rig to manage to click through 'find the motorcycle' kind of challenges. Paying a click farm to do it for you is probably cheaper in the short run.
An important hurdle is getting reputable IPv4 addresses to do it from, if you're going to do it a lot. Having or renting a botnet could help, but might be too illegal for your use case.
Some CDNs go to the length of fingerprinting the TLS and HTTP/2 handshakes to see if you're a bot. As others have mentioned, using an automated browser tends to be the broadest solution.
2. Outsourcing the task to one of the many CAPTCHA-solving services (2Captcha etc) – better
3. Using a pool of reliable IP addresses so you don't encounter checkboxes or turnstiles – best
I run a web scraping startup (https://simplescraper.io) and this is usually the approach[0]. It has become more difficult, and I think a lot of the AI crawlers are peeing in the pool with aggressive scraping, which is making the web a little bit worse for everyone.
[0] Worth mentioning that once you're "in" past the captcha, a smart scraper will try to use fetch to access more pages on the same domain so you only need to solve a fraction of possible captchas.