Hacker News new | ask | show | jobs
by SinjonSuarez 847 days ago
Check out the cloudscraper library if are having speed/cpu issues with sites that require js/have cloudfare defending them. That plus a proxy list plus threading allows me to make 300 requests a minute across 32 different proxies. Recently implemented it for a project: https://github.com/rezaisrad/discogs/tree/main/src/managers
2 comments

I've found myself writing the same session/proxy/rate limiting/header faking management code over and over for my scrapers. I've extracted it into it's own service that runs in docker and acts as a MITM proxy between you and target. It is client language agnostic, so you can write scrapers in python, node or whatever and still have great performance.

Highly recommend this approach, it allows you to separate infrastructure code, that gets highly complex as you need more requests, from actual spider/parser code that is usually pretty straightforward and project specific.

https://github.com/jkelin/forward-proxy-manager

This is great, was totally in the back of my mind as a next step.
Nicely written scraper, btw. Good code.
appreciate that! as a few mentioned here, there’s a lot of useful scraping tools/libraries to leverage these days. headless selenium no longer seems to make sense to me for most use cases