| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by adriancooney 3710 days ago
	This is a fantastic idea and I'm really surprised nothing like this has existed before, it seems like such a no-brainer. Great work.

8 comments

fizx 3710 days ago

https://github.com/fizx/parsley/wiki looks pretty similar.

Running this sort of thing as a service/api never panned out for us because you are almost universally robots.txt denied and/or blocked.

We briefly tried, and supported a wiki of json extraction scripts at parselets.org, but it went nowhere after a few months.

link

geuis 3710 days ago

I built something almost identical in 2011. It really doesn't have as much utility in practice as you think initially. CSS selectors are an interesting idea for extracting data from pages, but it's extremely fragile. You have to either parse the page's raw html using something like jsdom, or you run it through a headless browser like Phantom. In the first case, it completely fails for any modern SPA (angular, react, etc). In the second case, phantom is painfully slow and difficult to interact with, and often doesn't run/render an SPA as a regular browser does.

You can write tests around whether your selectors are returning data, but even simple refactors from a dev team quickly break your selector profiles multiple times a week or month.

Just wasn't worth the hassle.

link

mickael-kerjean 3709 days ago

There is some solutions to run a SPA in real browser even in a headless environment

The trick is to emulate x11 with xvfb and control it with selenium web driver.

Phantom isn't the only choice, just the one most people talk about

As for non js heavy website, it's fairly trivial to find a library that will parse the dom for you, pretty every language have one

link

NicoJuicy 3709 days ago

Done it also, to scrape HN in cli :p

link

LoSboccacc 3710 days ago

done it as well. at time specialized to organize web comics (it was way before google reader times).

real issue is that popularity will get you blocked fairly quickly. see also: yahoo pipes.

link

moeamaya 3710 days ago

There was a YC company a few years back that got acquired by Palantir in February that did something very similar. https://www.kimonolabs.com/

link

jancurn 3710 days ago

Well, https://www.apifier.com does essentially the same thing, plus it supports JavaScript, can crawl through the whole website etc.

Disclaimer: I'm a cofounder there

link

novaleaf 3709 days ago

so does https://PhantomJsCloud.com but single pages, no site crawling.

Disclaimer: I'm the founder there ;)

link

fiatjaf 3710 days ago

Apifier looks actually awesome.

link

xaduha 3710 days ago

Using a 3rd party site to query HTML (which you should be able to do yourself, plenty of tools for that) isn't a fantastic idea.

This one, for example http://www.videlibri.de/xidel.html#examples

link

martinvol 3710 days ago

The code is on Github, you can use this as a library, not as a 3rd party saas.

link

lorenzhs 3710 days ago

kimono labs used to do something similar, but shut down recently. They had a nice clicky pointy interface that allowed you to build the selectors by clicking on elements, with an immediate preview. They also handled things like pagination etc.

link

mappy 3710 days ago

> I'm really surprised nothing like this has existed before

But how would you monetize it?

Unlike an RSS feed, you really don't know how the JSON response would be used, so you can't inject ads into it.

And if you charge for it, wouldn't people assume it would continue to work, but site "scrapers", regardless of how they are configured, are likely to break, so it would be tougher having customers pay for something that could break at any time leaving them having to figure out if its the service that's changed/broken or the page that's changed.

Don't get me wrong- some great businesses have been/are based on "scraping" in one way or another. However, as cool as this is, it's just another way to "scrape". If the person hosting the page would provide an API or JSON view instead, you'd be loads better off.

link

nsgi 3710 days ago

Freemium, professional support, expanding it into an abstraction layer above the APIs for multiple services, selling a version that larger companies can run on their own servers which they might need for data security...

In any case, not everything has to be monetised.

link

g00gler 3709 days ago

>However, as cool as this is, it's just another way to "scrape"

Isn't that the point? The demo seems like it'd be a lot easier, less verbose, and probably less brittle, than using cUrl/xpaths or otherwise parsing that HTML yourself.

link

phsource 3709 days ago

We launched WrapAPI (https://wrapapi.com/) a few weeks ago with the same functionality, but a bit more complex and powerful process to get set up. You can not only specify CSS selectors yourself but define them point and click.

The barrier for starting with JamAPI is impressively low, though! Kudos on the developer-friendly user interface.

link