| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sxp 2059 days ago
	+1 to using cheerio.js. When I need to write a web scraper, I've used Node's `request` library to get the HTML text and cheerio to extract links and resources for the next stage. I've also used cheerio when I want to save a functioning local cache of a webpage since I can have it transform all the various multi-server references for <img>, <a>, <script>, etc on the page to locally valid URLs and then fetch those URLs.

4 comments

domenicd 2059 days ago

The article didn't touch on this very well, but the reason to upgrade from cheerio to jsdom is if you want to run scripts. E.g., for client-rendered apps, or apps that pull their data from XHR. Since jsdom implements the script element, and the XHR API, and a bunch of other APIs that pages might use, it can get a lot further in the page lifecycle than just "parse the bytes from the server into an initial DOM tree".

(I'm a maintainer of jsdom.)

link

megous 2059 days ago

Running the [arbitrary] scripts not written by me is what I usually try to avoid and fear when scraping.

link

rodw 2059 days ago

Self-plug warning but FWIW if you're using cheerio _just_ for the selector syntax a related tool is Stew [1] which is a dependency-free [2] node module that allows one to extract content from web pages (DOM trees) using CSS selectors, like:

var links = stew.select(dom,'a[href]');

extended with support for embeded regular expressions (for tags, classes, IDs, attributes or attribute values). E.g.:

var metadata = stew.select(dom,'head meta[name=/^dc\.|:/i]');

It's on npm as `stew-select`

[1] https://github.com/rodw/stew/

[2] there's an optional peer-dependency-ish relationship to htmlparser or htmlparser2 or similar to generate a DOM tree from raw HTML but anything that creates a basic DOM tree (`{type:, name:, children:[] }`) will suffice

link

rajangdavis 2059 days ago

Another +1 for cheerio.io

If I recall correctly, what was really helpful about it that I could write whatever code I would need to query and parse the DOM in the browser console and the copy and paste it into a script with almost no changes.

It made it really simple to go from a proof of concept into pipeline for scraping material and feeding it into a database.

link

tracker1 2059 days ago

Been using cheerio with node-fetch myself in these scenarios, puppeteer as well when I need a real browser.

link