Hacker News new | ask | show | jobs
by dabeeeenster 4463 days ago
The problem with these sorts of solutions is that they work perfectly for 'simple' sites like the register, but fail hard with 'modern' sites like, e.g. ASOS.com. Just tried ASOS and the web front end failed to request a product page correctly...

All the dynamic JS and whatnot just plays havoc with these projects. In my experience you have to run through webdriver or something like phantomjs and parse the JS...

7 comments

There are multiple internal tools I use at work (JIRA, our ticketing system, our code review tool) that won't work because of this issue.

In the meantime, I've written Tampermonkey scripts that will scrape and embedd multiple pages all hack-like, but at least I get a good CSV of the data I need.

To me, the value in this tool is the user interface for creating the scrape logic. If this ran as an embeddable JS app, that you could place inside any page and utilize local storage, you could scrape these dynamic sites by viewing the page first, and still get all of the cool gadetry provided by this tool.

In essence, the value of this tool could be built as a bookmarklet. THAT SIR - I would use every, single, day.

Great idea on the bookmarklet. I could see a tool for building custom readers with clippings from various sites. Say I want to organize JavaScript array patterns and ideas. Throw in a way to clip parts of my PDF books into this "reader" and you have an amazing product worth millions.
Can you explain this more, how do you see this being operated? See a pdf, clip it, create your own reader with your own clips?
Why would scrape JIRA when they have a perfectly workable API?
While their APIs are nice, they require separate permissions and having access to them isn't always a given depending on the company that runs/owns the Jira instance.
This. I also would rather just grab it in the browser instead of having to run a server or something else.
how would a bookmarklet be able to crawl & scrape a website?
You can have javascript code as a bookmarklet
yes I know that but how would you make it crawl links in your browser?
At first, this seems correct. It's definately easier to get scraping with something like Capybara and a suitable js enabled driver, but in my experience, this solution is less reliable. Async loaded data can time out and don't get me started on the difficulties of running the scraper with cron jobs. In the end, I migrated even my JS heavy pages to Mechanize based solutions. It takes a few extra requests to get the async data, but once you get that figured out, it's rock solid - till they update the site design ;-)
Use the tools suitable for the task. There are a lot of those "simple" sites and I'm pretty sure a lot of people will stick to those "dated" methods of building sites, because search traffic still matters.
The long tail is tough, but rules are useful when you only need to work with a small number of sites. And assuming, as you point out, less "modern" sites. (News sites tend to be mostly consistently manageable but, yes, smaller e-commerce players tend to adopt more modern techniques -- as befitting fashion-forward product lines, naturally).

Our (Diffbot) approach is to learn what news and product (and other) pages look like, and obviate the rules-management -- we also fully execute JS when rendering.

The web keeps evolving though, dang it. Tricky thing!

Unfortunately Diffbot is not open source. Are you planning any F/OSS offerings?
I built SnapSearch for JS/SPA sites that need SEO. But it works for scraping as well. https://snapsearch.io/ You can try the demo. I tried it with "http://www.asos.com/" and it worked properly. Note that empty content actually means that the webserver returned with no body content. The real API will return the headers as well the body.

It works via Firefox, and it's load balanced and multithreaded. It takes care of all the thorny issues regarding async content... etc.

It also depends on a coherent structure in HTML websites.

Domains running websites which are more like javascript frontend modules shouldn't be scraped at all, it screams for a public API.

"it screams for a public API"

But many content owners would never provide their data in this format even if doing-so would be trivial.

Try using https://snapsearch.io/ It is designed for JS sites.
These single page sites do have a public, albeit, undocumented API. If you analyze the network requests via the dev tools in your browser you'll have an XML/JSON data source that is probably structured better than the markup.
Of course, I should have thought about it that way.
Anybody know of any tools that would work with JS-rendered sites, and not have to "parse the JS"?
Answering my own question:

CasperJS is an open source navigation scripting & testing utility written in Javascript for the PhantomJS WebKit headless browser and SlimerJS (Gecko). It eases the process of defining a full navigation scenario and provides useful high-level functions, methods & syntactic sugar for doing common tasks such as:

    defining & ordering browsing navigation steps
    filling & submitting forms
    clicking & following links
    capturing screenshots of a page (or part of it)
    testing remote DOM
    logging events
    downloading resources, including binary ones
    writing functional test suites, saving results as JUnit XML
    scraping Web contents
I wrote a blog post on my experiences using CasperJS to parse a single page site which used angular. http://www.andykelk.net/tech/web-scraping-with-casperjs
I recently created a service designed to make JS sites crawlable by search engines and other robots. However it works for scraping as well. Try the demo: https://snapsearch.io/
PhantomJS?
>Is it also a webscraper that can pull data out of a page for me?

No, Phantom will only recreate the page as it would look like to a human user (i.e. after all javascript is parsed and executed). It will not help you parse or slice the page - you would have to do that part programatically with other dom-parsing tools.

Sorry. I guess you're right. As a programmer, both of those look the same to me.
"PhantomJS is a headless WebKit scriptable with a JavaScript API", so it's a browser.

Is it also a webscraper that can pull data out of a page for me?

I've heard of people using PhantomJS with CasperJS to scrape, not sure if it can be done solely with PhantomJS.
CasperJS is a higher-level wrapper for PhantomJS, so - yes, it could be done with PhantomJS solely... But you wouldn't want to, because CasperJS makes automation easier.
Are there any libraries to facilitate database-connectivity to SQL Server or MySQL from javascript? I've used CasperJS to scrape some sites, but always fall back on post-processing the scraped data with another program in order to get it into my database. I'd love to be able to do it all from one piece of code.
SlimerJS too