Hacker News new | ask | show | jobs
by egb 4466 days ago
Anybody know of any tools that would work with JS-rendered sites, and not have to "parse the JS"?
3 comments

Answering my own question:

CasperJS is an open source navigation scripting & testing utility written in Javascript for the PhantomJS WebKit headless browser and SlimerJS (Gecko). It eases the process of defining a full navigation scenario and provides useful high-level functions, methods & syntactic sugar for doing common tasks such as:

    defining & ordering browsing navigation steps
    filling & submitting forms
    clicking & following links
    capturing screenshots of a page (or part of it)
    testing remote DOM
    logging events
    downloading resources, including binary ones
    writing functional test suites, saving results as JUnit XML
    scraping Web contents
I wrote a blog post on my experiences using CasperJS to parse a single page site which used angular. http://www.andykelk.net/tech/web-scraping-with-casperjs
I recently created a service designed to make JS sites crawlable by search engines and other robots. However it works for scraping as well. Try the demo: https://snapsearch.io/
PhantomJS?
>Is it also a webscraper that can pull data out of a page for me?

No, Phantom will only recreate the page as it would look like to a human user (i.e. after all javascript is parsed and executed). It will not help you parse or slice the page - you would have to do that part programatically with other dom-parsing tools.

Sorry. I guess you're right. As a programmer, both of those look the same to me.
"PhantomJS is a headless WebKit scriptable with a JavaScript API", so it's a browser.

Is it also a webscraper that can pull data out of a page for me?

I've heard of people using PhantomJS with CasperJS to scrape, not sure if it can be done solely with PhantomJS.
CasperJS is a higher-level wrapper for PhantomJS, so - yes, it could be done with PhantomJS solely... But you wouldn't want to, because CasperJS makes automation easier.
Are there any libraries to facilitate database-connectivity to SQL Server or MySQL from javascript? I've used CasperJS to scrape some sites, but always fall back on post-processing the scraped data with another program in order to get it into my database. I'd love to be able to do it all from one piece of code.
You could always have your CasperJS scraping script make an AJAX request to a RESTful API for your MySQL DB. You won't be doing it all from one piece of code, but you'll be doing about 90% of it.
SlimerJS too