Hacker News new | ask | show | jobs
by mxavier 5538 days ago
The article fails to mention this but there's probably more reasons why this might be a good idea besides the fact that using JS selectors on page content is a natural fit. Because everything is asynchronous, I suppose there's probably some concurrency benefits, not allowing a slow-responding server in your list to slow down the processing of the other sites you're scraping.
1 comments

Node.JS seemed like a perfect fit for a few reason:

1. JS selectors make scraping _very_ easy.

2. Asynchronous is fast as it is, but the page is actually parsed as it's received - contrast this with other scraping solutions where you need to download a page and parse it once it's complete.

3. With asynchronous scraping it's trivial to handle failures, timeouts, retries, nested requests, recursing similar URLs, concurrent requests, etc. - just add one of the many options (https://github.com/chriso/node.io/wiki/API---Job-Options)