Hacker News new | ask | show | jobs
by TheAceOfHearts 3035 days ago
Some people are mentioning headless Chromium, so I wanna mention another tool I've used to replace some of phantomjs' functionality: jsdom [0].

It's much more lightweight than a real browser, and it doesn't require large extra binaries.

I don't do any complex scrapping, but occasionally I want to pull down and aggregate a site's data. For most pages, it's as simple as making a request and passing the response into a new jsdom instance. You can then query the DOM using the same built-in browser APIs you're already familiar with.

I've previously used jsdom to run a large web app's tests on node, which provided a huge performance boost and drastically lowered our build times. As long as you maintain a good architecture (i.e. isolating browser specific bits from your business logic) you're unlikely to encounter any pitfalls. Our testing strategy was to use node and jsdom during local testing and on each commit. IMO, you should generally only need to run tests on an actual browser before each release (as a safety net), and possibly on a regular schedule (if your release cycle is long).

[0] https://www.npmjs.com/package/jsdom

5 comments

Cheerio [0] is fantastic for this as well...

[0]: https://www.npmjs.com/package/cheerio

I've tried Cheerio as well, but I prefer JSDOM since it exposes the DOM APIs. What I'll normally do is interactively test things out in the browser's console, and then transfer em over to my script. Browser dev tools are just super amazing.
Agreed - I find the Cheerio APIs to be awkward when traversing deep into the DOM. Last time I used Beautiful Soup I found it also had this problem. The DOM API that JSDOM provides is such much more natural to work with.
Cheerio is so much faster and more lightweight than jsDOM. If you can do it with Cheerio, you absolutely should.
Using Cheerio with TypeScript types to parse data from tables in downloaded HTML files. Great tool.
One question I've had recently is how to scrape out a Javascript object out of HTML source. With server-side react + redux, I've wanted to be able to scrap out the serialised var __STATE__ = {...} object to JSON, from nodejs. Best solution I cobbled together was to basically eval() the JS source, which I know is far from ideal.
You could use a parser like esprima or its equivalent from the babeljs ecosystem on the JS source instead and just find the global variable with name `__STATE__` and just eval its init expression. Cheaper, more secure, more direct than actually running the JS.
I actually looked into this (from reading docs, never wrote code) and I wasn't able to find a way to convert the AST for the ObjectExpression into JSON or an actual Javascript object.
What you need is a code generation library that will turn the AST back into JS code once you've identified which part of the syntax tree you're interested in. And that's the code you want to then eval(). Esprima has escodegen for that purpose. I'm not sure what the counterparts are in the babel world. Feel free to shoot me an email with any specifics of where you're getting stuck thinking this through (email should be visible from my profile?), and I'll be glad to help.
You can use the vm module [0] to securely execute the code.

[0] https://nodejs.org/api/vm.html

Right before release is a bad time to realize there are problems with your build
Better than right after release.
Depends on how often you release and who sets the schedule.
jsdom is pretty unforgiving and won't load broken HTML. That's where I stopped using it. Could use some tidy tool maybe.
jsdom is an impressive achievement, but it may not be what you want depending on what you’re trying to do. It doesn’t mimic the behavior of browsers well in a number of regards, so it will let you do things that real browsers don’t allow. If you’re doing integration-type testing that can lead to tests that pass but functionality that fails in real browsers.