| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by SinjonSuarez 847 days ago
	Check out the cloudscraper library if are having speed/cpu issues with sites that require js/have cloudfare defending them. That plus a proxy list plus threading allows me to make 300 requests a minute across 32 different proxies. Recently implemented it for a project: https://github.com/rezaisrad/discogs/tree/main/src/managers

2 comments

fireant 846 days ago

I've found myself writing the same session/proxy/rate limiting/header faking management code over and over for my scrapers. I've extracted it into it's own service that runs in docker and acts as a MITM proxy between you and target. It is client language agnostic, so you can write scrapers in python, node or whatever and still have great performance.

Highly recommend this approach, it allows you to separate infrastructure code, that gets highly complex as you need more requests, from actual spider/parser code that is usually pretty straightforward and project specific.

https://github.com/jkelin/forward-proxy-manager

SinjonSuarez 846 days ago

This is great, was totally in the back of my mind as a next step.

mndgs 847 days ago

Nicely written scraper, btw. Good code.

SinjonSuarez 847 days ago

appreciate that! as a few mentioned here, there’s a lot of useful scraping tools/libraries to leverage these days. headless selenium no longer seems to make sense to me for most use cases