Hacker News new | ask | show | jobs
by sxp 2059 days ago
+1 to using cheerio.js. When I need to write a web scraper, I've used Node's `request` library to get the HTML text and cheerio to extract links and resources for the next stage.

I've also used cheerio when I want to save a functioning local cache of a webpage since I can have it transform all the various multi-server references for <img>, <a>, <script>, etc on the page to locally valid URLs and then fetch those URLs.

4 comments

The article didn't touch on this very well, but the reason to upgrade from cheerio to jsdom is if you want to run scripts. E.g., for client-rendered apps, or apps that pull their data from XHR. Since jsdom implements the script element, and the XHR API, and a bunch of other APIs that pages might use, it can get a lot further in the page lifecycle than just "parse the bytes from the server into an initial DOM tree".

(I'm a maintainer of jsdom.)

Running the [arbitrary] scripts not written by me is what I usually try to avoid and fear when scraping.
Self-plug warning but FWIW if you're using cheerio _just_ for the selector syntax a related tool is Stew [1] which is a dependency-free [2] node module that allows one to extract content from web pages (DOM trees) using CSS selectors, like:

var links = stew.select(dom,'a[href]');

extended with support for embeded regular expressions (for tags, classes, IDs, attributes or attribute values). E.g.:

var metadata = stew.select(dom,'head meta[name=/^dc\.|:/i]');

It's on npm as `stew-select`

[1] https://github.com/rodw/stew/

[2] there's an optional peer-dependency-ish relationship to htmlparser or htmlparser2 or similar to generate a DOM tree from raw HTML but anything that creates a basic DOM tree (`{type:, name:, children:[] }`) will suffice

Another +1 for cheerio.io

If I recall correctly, what was really helpful about it that I could write whatever code I would need to query and parse the DOM in the browser console and the copy and paste it into a script with almost no changes.

It made it really simple to go from a proof of concept into pipeline for scraping material and feeding it into a database.

Been using cheerio with node-fetch myself in these scenarios, puppeteer as well when I need a real browser.