|
|
|
|
|
by mxavier
5538 days ago
|
|
The article fails to mention this but there's probably more reasons why this might be a good idea besides the fact that using JS selectors on page content is a natural fit. Because everything is asynchronous, I suppose there's probably some concurrency benefits, not allowing a slow-responding server in your list to slow down the processing of the other sites you're scraping. |
|
1. JS selectors make scraping _very_ easy.
2. Asynchronous is fast as it is, but the page is actually parsed as it's received - contrast this with other scraping solutions where you need to download a page and parse it once it's complete.
3. With asynchronous scraping it's trivial to handle failures, timeouts, retries, nested requests, recursing similar URLs, concurrent requests, etc. - just add one of the many options (https://github.com/chriso/node.io/wiki/API---Job-Options)