| HN Mirror

I believe scrapy has somewhat intelligent cache control options - maybe it could be recreated in a few dozens of lines of code, maybe a few hundred. But there are a huge number of these types of features - it’s basically a Swiss Army knife.

Examples include rotating proxies, rotating user agent headers. Hooks to add in middleware for processing pipelines. CLI switches to change your data output format. Nice debugging and logging.

Other large scale features include distributed crawlers. Scheduling. Monitoring UI so you can see progress via a web UI.

It’s what I reach for first, because you can be up and running with your first scraper in an hour. By hand, that’s maybe 10 minutes - but if you want to iterate, and your first scraper is a v1 rather than final effort… i think it’s definitely worth it.