Hacker News new | ask | show | jobs
by joe_91 1615 days ago
I'm scraping about 30 sites for work at the moment, but have a few that are using Cloudflare which has been a b*tch to deal with. Tried numerous libraries and different proxy providers, but reliability is patchy. Previous fixes like https://github.com/Anorov/cloudflare-scrape don't seem to work anymore after Cloudflare updates, so I've switched to using a pretty optimised headless browser with good proxies instead.
6 comments

This has a lot of good info on how to cloudflare and others work, and more creative ways to bypass them if the easier options don't work https://incolumitas.com/2021/05/20/avoid-puppeteer-and-playw...
I'm finding that Cloudflare is even blocking my RSS reader from requesting feeds behind their service. It's not even just scrapers at this point.
> optimised headless browser with good proxies instead

are you saying you only had problem because you didn't use headless browser before and now with both headless and proxy it generally suffices to not be seen as scrapper?

I think it will eventually goes to like stock trading. If you have a good strategy, you don't want to share with the world, because it will render your strategy useless.
Is the “pretty optimized headless browser” an off the shelf thing, or something custom? Are you using playwright/puppeteer to drive it?
Headless Chrome [0] and alpine-Chrome [1] are pretty popular. Some variations also include V2Ray, Shadowsocks and other VPNs.

[0] https://hub.docker.com/r/justinribeiro/chrome-headless/

[1] https://github.com/Zenika/alpine-chrome

Do you have any recommendations for the "good proxies" you mentioned?