Hacker News new | ask | show | jobs
by fareesh 1615 days ago
My toolbox of choice for web scraping is either Nokogiri or puppeteer

Can someone sell me on beautiful soup or scrapy or any of the others? Do they provide any advantages or features that I'd be missing out on?

1 comments

One great scrapy feauture is caching the page content. So you can essentially write a crawler, and when that’s running, you write your extraction code. Then, if you want to go back, you can add more extractors and run it against your local copy.
Ah interesting, I end up doing this manually, i.e. File.write followed by what I want to scrape
I believe scrapy has somewhat intelligent cache control options - maybe it could be recreated in a few dozens of lines of code, maybe a few hundred. But there are a huge number of these types of features - it’s basically a Swiss Army knife.

Examples include rotating proxies, rotating user agent headers. Hooks to add in middleware for processing pipelines. CLI switches to change your data output format. Nice debugging and logging.

Other large scale features include distributed crawlers. Scheduling. Monitoring UI so you can see progress via a web UI.

It’s what I reach for first, because you can be up and running with your first scraper in an hour. By hand, that’s maybe 10 minutes - but if you want to iterate, and your first scraper is a v1 rather than final effort… i think it’s definitely worth it.