Hacker News new | ask | show | jobs
by adriancooney 3710 days ago
This is a fantastic idea and I'm really surprised nothing like this has existed before, it seems like such a no-brainer. Great work.
8 comments

https://github.com/fizx/parsley/wiki looks pretty similar.

Running this sort of thing as a service/api never panned out for us because you are almost universally robots.txt denied and/or blocked.

We briefly tried, and supported a wiki of json extraction scripts at parselets.org, but it went nowhere after a few months.

I built something almost identical in 2011. It really doesn't have as much utility in practice as you think initially. CSS selectors are an interesting idea for extracting data from pages, but it's extremely fragile. You have to either parse the page's raw html using something like jsdom, or you run it through a headless browser like Phantom. In the first case, it completely fails for any modern SPA (angular, react, etc). In the second case, phantom is painfully slow and difficult to interact with, and often doesn't run/render an SPA as a regular browser does.

You can write tests around whether your selectors are returning data, but even simple refactors from a dev team quickly break your selector profiles multiple times a week or month.

Just wasn't worth the hassle.

There is some solutions to run a SPA in real browser even in a headless environment

The trick is to emulate x11 with xvfb and control it with selenium web driver.

Phantom isn't the only choice, just the one most people talk about

As for non js heavy website, it's fairly trivial to find a library that will parse the dom for you, pretty every language have one

Done it also, to scrape HN in cli :p
done it as well. at time specialized to organize web comics (it was way before google reader times).

real issue is that popularity will get you blocked fairly quickly. see also: yahoo pipes.

There was a YC company a few years back that got acquired by Palantir in February that did something very similar. https://www.kimonolabs.com/
Well, https://www.apifier.com does essentially the same thing, plus it supports JavaScript, can crawl through the whole website etc.

Disclaimer: I'm a cofounder there

so does https://PhantomJsCloud.com but single pages, no site crawling.

Disclaimer: I'm the founder there ;)

Apifier looks actually awesome.
Using a 3rd party site to query HTML (which you should be able to do yourself, plenty of tools for that) isn't a fantastic idea.

This one, for example http://www.videlibri.de/xidel.html#examples

The code is on Github, you can use this as a library, not as a 3rd party saas.
kimono labs used to do something similar, but shut down recently. They had a nice clicky pointy interface that allowed you to build the selectors by clicking on elements, with an immediate preview. They also handled things like pagination etc.
> I'm really surprised nothing like this has existed before

But how would you monetize it?

Unlike an RSS feed, you really don't know how the JSON response would be used, so you can't inject ads into it.

And if you charge for it, wouldn't people assume it would continue to work, but site "scrapers", regardless of how they are configured, are likely to break, so it would be tougher having customers pay for something that could break at any time leaving them having to figure out if its the service that's changed/broken or the page that's changed.

Don't get me wrong- some great businesses have been/are based on "scraping" in one way or another. However, as cool as this is, it's just another way to "scrape". If the person hosting the page would provide an API or JSON view instead, you'd be loads better off.

Freemium, professional support, expanding it into an abstraction layer above the APIs for multiple services, selling a version that larger companies can run on their own servers which they might need for data security...

In any case, not everything has to be monetised.

>However, as cool as this is, it's just another way to "scrape"

Isn't that the point? The demo seems like it'd be a lot easier, less verbose, and probably less brittle, than using cUrl/xpaths or otherwise parsing that HTML yourself.

We launched WrapAPI (https://wrapapi.com/) a few weeks ago with the same functionality, but a bit more complex and powerful process to get set up. You can not only specify CSS selectors yourself but define them point and click.

The barrier for starting with JamAPI is impressively low, though! Kudos on the developer-friendly user interface.