| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by joe_91 1615 days ago
	I'm scraping about 30 sites for work at the moment, but have a few that are using Cloudflare which has been a b*tch to deal with. Tried numerous libraries and different proxy providers, but reliability is patchy. Previous fixes like https://github.com/Anorov/cloudflare-scrape don't seem to work anymore after Cloudflare updates, so I've switched to using a pretty optimised headless browser with good proxies instead.

6 comments

Ian_Kerins 1615 days ago

This has a lot of good info on how to cloudflare and others work, and more creative ways to bypass them if the easier options don't work https://incolumitas.com/2021/05/20/avoid-puppeteer-and-playw...

link

nanna 1615 days ago

I'm finding that Cloudflare is even blocking my RSS reader from requesting feeds behind their service. It's not even just scrapers at this point.

link

nsonha 1615 days ago

> optimised headless browser with good proxies instead

are you saying you only had problem because you didn't use headless browser before and now with both headless and proxy it generally suffices to not be seen as scrapper?

link

temp8964 1615 days ago

I think it will eventually goes to like stock trading. If you have a good strategy, you don't want to share with the world, because it will render your strategy useless.

link

emptysea 1615 days ago

Is the “pretty optimized headless browser” an off the shelf thing, or something custom? Are you using playwright/puppeteer to drive it?

link

mycall 1615 days ago

Headless Chrome [0] and alpine-Chrome [1] are pretty popular. Some variations also include V2Ray, Shadowsocks and other VPNs.

[0] https://hub.docker.com/r/justinribeiro/chrome-headless/

[1] https://github.com/Zenika/alpine-chrome

link

rozenmd 1615 days ago

There are plugins for Puppeteer: https://github.com/berstend/puppeteer-extra/tree/master/pack...

link

valar_m 1615 days ago

Do you have any recommendations for the "good proxies" you mentioned?

link