| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dabeeeenster 4463 days ago
	The problem with these sorts of solutions is that they work perfectly for 'simple' sites like the register, but fail hard with 'modern' sites like, e.g. ASOS.com. Just tried ASOS and the web front end failed to request a product page correctly... All the dynamic JS and whatnot just plays havoc with these projects. In my experience you have to run through webdriver or something like phantomjs and parse the JS...

7 comments

alttab 4463 days ago

There are multiple internal tools I use at work (JIRA, our ticketing system, our code review tool) that won't work because of this issue.

In the meantime, I've written Tampermonkey scripts that will scrape and embedd multiple pages all hack-like, but at least I get a good CSV of the data I need.

To me, the value in this tool is the user interface for creating the scrape logic. If this ran as an embeddable JS app, that you could place inside any page and utilize local storage, you could scrape these dynamic sites by viewing the page first, and still get all of the cool gadetry provided by this tool.

In essence, the value of this tool could be built as a bookmarklet. THAT SIR - I would use every, single, day.

shaneofalltrad 4462 days ago

Great idea on the bookmarklet. I could see a tool for building custom readers with clippings from various sites. Say I want to organize JavaScript array patterns and ideas. Throw in a way to clip parts of my PDF books into this "reader" and you have an amazing product worth millions.

notastartup 4462 days ago

Can you explain this more, how do you see this being operated? See a pdf, clip it, create your own reader with your own clips?

stedaniels 4463 days ago

Why would scrape JIRA when they have a perfectly workable API?

lepht 4463 days ago

While their APIs are nice, they require separate permissions and having access to them isn't always a given depending on the company that runs/owns the Jira instance.

alttab 4463 days ago

This. I also would rather just grab it in the browser instead of having to run a server or something else.

notastartup 4462 days ago

how would a bookmarklet be able to crawl & scrape a website?

umurkontaci 4462 days ago

You can have javascript code as a bookmarklet

notastartup 4462 days ago

yes I know that but how would you make it crawl links in your browser?

CHsurfer 4463 days ago

At first, this seems correct. It's definately easier to get scraping with something like Capybara and a suitable js enabled driver, but in my experience, this solution is less reliable. Async loaded data can time out and don't get me started on the difficulties of running the scraper with cron jobs. In the end, I migrated even my JS heavy pages to Mechanize based solutions. It takes a few extra requests to get the async data, but once you get that figured out, it's rock solid - till they update the site design ;-)

yaph 4463 days ago

Use the tools suitable for the task. There are a lot of those "simple" sites and I'm pretty sure a lot of people will stick to those "dated" methods of building sites, because search traffic still matters.

johndavi 4463 days ago

The long tail is tough, but rules are useful when you only need to work with a small number of sites. And assuming, as you point out, less "modern" sites. (News sites tend to be mostly consistently manageable but, yes, smaller e-commerce players tend to adopt more modern techniques -- as befitting fashion-forward product lines, naturally).

Our (Diffbot) approach is to learn what news and product (and other) pages look like, and obviate the rules-management -- we also fully execute JS when rendering.

The web keeps evolving though, dang it. Tricky thing!

lsh 4463 days ago

Unfortunately Diffbot is not open source. Are you planning any F/OSS offerings?

CMCDragonkai 4461 days ago

I built SnapSearch for JS/SPA sites that need SEO. But it works for scraping as well. https://snapsearch.io/ You can try the demo. I tried it with "http://www.asos.com/" and it worked properly. Note that empty content actually means that the webserver returned with no body content. The real API will return the headers as well the body.

It works via Firefox, and it's load balanced and multithreaded. It takes care of all the thorny issues regarding async content... etc.

agumonkey 4463 days ago

It also depends on a coherent structure in HTML websites.

Domains running websites which are more like javascript frontend modules shouldn't be scraped at all, it screams for a public API.

uptown 4463 days ago

"it screams for a public API"

But many content owners would never provide their data in this format even if doing-so would be trivial.

CMCDragonkai 4461 days ago

Try using https://snapsearch.io/ It is designed for JS sites.

jdavis703 4463 days ago

These single page sites do have a public, albeit, undocumented API. If you analyze the network requests via the dev tools in your browser you'll have an XML/JSON data source that is probably structured better than the markup.

agumonkey 4463 days ago

Of course, I should have thought about it that way.

egb 4463 days ago

Anybody know of any tools that would work with JS-rendered sites, and not have to "parse the JS"?

egb 4463 days ago

Answering my own question:

CasperJS is an open source navigation scripting & testing utility written in Javascript for the PhantomJS WebKit headless browser and SlimerJS (Gecko). It eases the process of defining a full navigation scenario and provides useful high-level functions, methods & syntactic sugar for doing common tasks such as:

    defining & ordering browsing navigation steps
    filling & submitting forms
    clicking & following links
    capturing screenshots of a page (or part of it)
    testing remote DOM
    logging events
    downloading resources, including binary ones
    writing functional test suites, saving results as JUnit XML
    scraping Web contents

mopoke 4462 days ago

I wrote a blog post on my experiences using CasperJS to parse a single page site which used angular. http://www.andykelk.net/tech/web-scraping-with-casperjs

CMCDragonkai 4461 days ago

I recently created a service designed to make JS sites crawlable by search engines and other robots. However it works for scraping as well. Try the demo: https://snapsearch.io/

checker659 4463 days ago

PhantomJS?

e1g 4463 days ago

>Is it also a webscraper that can pull data out of a page for me?

No, Phantom will only recreate the page as it would look like to a human user (i.e. after all javascript is parsed and executed). It will not help you parse or slice the page - you would have to do that part programatically with other dom-parsing tools.

checker659 4463 days ago

Sorry. I guess you're right. As a programmer, both of those look the same to me.

egb 4463 days ago

"PhantomJS is a headless WebKit scriptable with a JavaScript API", so it's a browser.

Is it also a webscraper that can pull data out of a page for me?

nols 4463 days ago

I've heard of people using PhantomJS with CasperJS to scrape, not sure if it can be done solely with PhantomJS.

vaviloff 4463 days ago

CasperJS is a higher-level wrapper for PhantomJS, so - yes, it could be done with PhantomJS solely... But you wouldn't want to, because CasperJS makes automation easier.

uptown 4463 days ago

Are there any libraries to facilitate database-connectivity to SQL Server or MySQL from javascript? I've used CasperJS to scrape some sites, but always fall back on post-processing the scraped data with another program in order to get it into my database. I'd love to be able to do it all from one piece of code.

lost_my_pwd 4463 days ago

SlimerJS too