A little word of warning/encouragement. I did something similar a long time ago (JSONDuit), which got posted to HN by someone else.
You will probably run into a healthy mix of "that's cool" / "I did that before you!" / "but how will it make money?". Ignore it and do your thing. If you figure out how to monetize it, great! Even if you don't or if you have no desire to, you will have learned and grown during the course of the project. That is invaluable.
I find this attitude "shoot first (write code), ask questions later" as something to be admired and a bit worrisome at the same time. Nothing against people learning stuff, but why does it have to be promoted this way? Lack of humility is what gets me.
Maybe I'm just jealous or something, but it rubs me the wrong way.
I think it's less about promotion and more about feedback. "Here's what I've built, what do you think?"
We're meant to be an inclusive community of smart people. The idea is we'll encourage the poster and offer constructive criticism (or praise).
If the post is useful to no one, it simply won't get discussed or upvoted. When something does, it's validated as an idea, or as something of interest.
I was trying to implement a Firefox add-on which navigates the web based on speech for my senior project (the theme was assistive technology). The major blocker with developing any tools to help navigate web page despite some effort in ARIA is that there are so many actions one cannot parse nor do without writing custom code for every single website. i.e. how can you tell where the login button is? what to click? frankly if you look at Gmail the DOM was a huge huge compressed mess (names all rewrote, without gmail.js I wouldn't be able to get Gmail working in my project). If every website exposes a standard set of APIs that can reduce the barrier by a good percentage. So think of combinging HATEOS from REST and this. Here we turn things into JSON with href allowing a client to navigate, sort of a first step making a website more "web client" compatible... funny is it?
It's not a totally original concept. Screen-scraping has been around for a while - essentially what this solves. I did mine for Ajax:
this.get(html, selector, function(s){
var es = new DOMParser().parseFromString(html, 'text/html').querySelectorAll(selector);
return [].slice.call(es).map(function(n){ return n.innerText });
};
It's not a product per-se, but combination of data and view is one of the unfortunate aspects of the web that (sorry) won't get fixed - Not everyone will build JSON apis. And, hate away but HTML & JS are here for a long time to come. The need is very real and would be a critical part of a scrape or IFFFT like service - plumbing - if not a product you sell outright to end users.
- mentions the term that this concept falls under (no where on the OP's page, so he may not know that there is an entire set of software, plugins, etc that does this)
- provides one alternative implementation
- adds commentary related to why such services are necessary, and that they should be able to be monetized
So yes, he starts off with something along the lines of "I did that before you!", but he doesn't use a condescending phrase, and he provides additional useful information.
I think the wanting to know how it will be monetized /support itself is a reasonable question. If there isnt an answer you know not to build something using it, expecting it to last.
I built something almost identical in 2011. It really doesn't have as much utility in practice as you think initially. CSS selectors are an interesting idea for extracting data from pages, but it's extremely fragile. You have to either parse the page's raw html using something like jsdom, or you run it through a headless browser like Phantom. In the first case, it completely fails for any modern SPA (angular, react, etc). In the second case, phantom is painfully slow and difficult to interact with, and often doesn't run/render an SPA as a regular browser does.
You can write tests around whether your selectors are returning data, but even simple refactors from a dev team quickly break your selector profiles multiple times a week or month.
kimono labs used to do something similar, but shut down recently. They had a nice clicky pointy interface that allowed you to build the selectors by clicking on elements, with an immediate preview. They also handled things like pagination etc.
> I'm really surprised nothing like this has existed before
But how would you monetize it?
Unlike an RSS feed, you really don't know how the JSON response would be used, so you can't inject ads into it.
And if you charge for it, wouldn't people assume it would continue to work, but site "scrapers", regardless of how they are configured, are likely to break, so it would be tougher having customers pay for something that could break at any time leaving them having to figure out if its the service that's changed/broken or the page that's changed.
Don't get me wrong- some great businesses have been/are based on "scraping" in one way or another. However, as cool as this is, it's just another way to "scrape". If the person hosting the page would provide an API or JSON view instead, you'd be loads better off.
Freemium, professional support, expanding it into an abstraction layer above the APIs for multiple services, selling a version that larger companies can run on their own servers which they might need for data security...
>However, as cool as this is, it's just another way to "scrape"
Isn't that the point? The demo seems like it'd be a lot easier, less verbose, and probably less brittle, than using cUrl/xpaths or otherwise parsing that HTML yourself.
We launched WrapAPI (https://wrapapi.com/) a few weeks ago with the same functionality, but a bit more complex and powerful process to get set up. You can not only specify CSS selectors yourself but define them point and click.
The barrier for starting with JamAPI is impressively low, though! Kudos on the developer-friendly user interface.
I put this similar project[0] together a while ago. Almost the same concept, but I skipped the json layer altogether as I just wanted a quick way of getting nuggets of content from webpages into my terminal.
Incidentally, you don't really need to have that "index" key inside the values of an array, because in an array the order is preserved anyway. Unless I've misunderstood what it means?
Regarding the "index" key, there are some JSON parsers for languages like Swift that will rearrange your JSON. By adding the index key, you'll still be able to sort after parsing.
Also, thanks, it's really cool to see people liking this :)
They might rearrange keys in a JSON object, but in an array they should be preserved in order as according to the spec[1]. If Swift does this (which I can't really check) than this would be a bug.
[1] http://www.json.org/: An array is an ordered collection of values. An array begins with [ (left bracket) and ends with ] (right bracket). Values are separated by , (comma).
Yes, the order of elements in an array should always be preserved. For example, we might be expecting the first element to be a name, the second to be a date of birth, etc. We should use an object for that, but that's for reasons of readability, extensibility, etc. rather than array semantics being unsuitable.
Also, jq has a `--sort-keys` option which tries to make the output as reproducible/canonical as possible. From the manual:
> The keys are sorted "alphabetically", by unicode codepoint order. This is not an order that makes particular sense in any particular language, but you can count on it being the same for any two objects with the same set of keys, regardless of locale settings.
It would be strange for a JSON tool to go to such lengths to normalise data, if array order were unpredictable.
Very nice idea. Although scraping should always be a last resort, I could imagine using this for semi-serious purposes, i.e. when I care enough about the output, will be doing many requests, don't mind relaying data via a third-party, etc.
I currently do quite a bit of scraping for my own use (generating RSS feeds for sites, making simple commandline interfaces to automate common tasks, etc.). I've found xidel to be pretty good for this: it starts off pretty simple (e.g. with CSS selectors or XPath), but gets pretty gnarly for semi-complicated things. For example, it allows templating the output, using a language I struggle to grasp. This service seems to address that middle ground, e.g. restricting its output to JSON, and hence making the specification of the output much simpler (a nice JSON structure, rather than messing around with splicing text together).
Great! I've been trying to get my head around Scrapy, and I have little Python experience. This seems to fit in a lot better with my skillset for the project I'm working on.
I'm using Apifier at the moment, which I really like, but my biggest gripe is the awkwardness of source (and VCS) integration. The best I've come up with is to export the JSON config (which contains the scraper source code as a value - yuck) and try to remember to keep re-exporting and checking it in.
Having also had to hack around the inability to parameterise the scrape url (e.g. 'profile/$username') - which they've since added support for - I started to wonder if I mightn't as well just use BeautifulSoup (Python HTML parser lib) and check it in properly.
This is probably my ideal. I can keep it all in source control because it's just an HTTP request body, and I can parameterise it because, well, it's just an HTTP request body!
It's also open source because you're an amazing person; so if I had one little concern left about the availability of your site I can dismiss it right away since I could run my own on Heroku should jamapi.xyz prove unsustainable. It's possibly a better idea to do that anyway, but I often wonder if Heroku doesn't consider that a problem - multiple instances of the same app running on free dynos under different accounts...
I think with advent of tools like this developers more and more will be thinking of ways to make it hard to have someone scrape their website into data structures. I wonder if we are going to see the same thing that happened to minimized js happening to html more and more. I know there are sites that dynamically change css class names and ids. But I think soon we will also see div hierarchies to dynamically change form without presentationally looking different to the end user.
I had been trying to figure out what would be causing this issue, thanks for pointing it out, I've pushed a fix real quick that will respond whether JSON is invalid or a CSS selector wasn't found on the provided URL.
Does anyone have any information on anyone that's used HTTP as an API to share/create metadata for any transactions, content, etc. publicly online? I would very curious to know about it!
OT perhaps: I'm still looking for a solution that has a graphical UI that allows users to point and click an element on their page and returns the corresponding CSS-selector. SelectorGadget does this as a chrome-extension but I'm looking for something that works without an extension.
AFAIK, Selectorgadget's chrome extension is just a wrapper around the bookmarklet. It's pure JS, doesn't use any sort of elevated privileges, and is MIT licensed so you can include the core engine in your own projects.
You'd need to either re-implement an entire browser stack or run a headless version of gecko of webkit server-side.
The former entails millions of man-hours of work. The latter opens up your server to all sorts of exploits. Overall a really bad idea.
Besides, single page applications are the worst junk in the entire Web 2.0 cesspool. If you really need to scrape them, they usually come with their own JSON API which you can just piggyback.
Why on Earth would the OP start from scratch? Besides, though not a solo and OSS effort, Apifier does this; certainly without "millions" of hours having been spent on it.
If anyone remembers, they was a YC company that did exactly this. It was called Kimono Labs. I think it failed and just got acquired a year ago. "Jam API" will probably do way better because, well, open source.
I've been thinking about writing some website-to-JSON scrapers myself and this basically solves that problem (since I would have been going after CSS selectors or xpath anyway myself). Nice job.
CloudFlare will make sure the browser can run JS, which in the case of this service I assume it won't. There are ways around this of course, using headless browsers (e.g. PhantomJS), tools like cloudflare-scrape[0] (which uses PyExecJS[1]). I've even used PyQt5 to render webpages for similar purposes.
Unfortunately the aforementioned tools are generally pretty slow (especially headless browsers). Also I can't imagine it's particularly safe running such a service.
I wrote a language that's basically a superset of this (https://github.com/fizx/parsley/wiki) back in 2008 and used it to crawl a variety of insane job posting sites.
As crawling complexity increases, pretty soon you want an actual programming language to specify things like crawl order and cache behavior. Multi-page behavior was very hard to describe declaratively for misbehaving sites.
Also, it's a terrible default (for security reasons) to let the web pages you're parsing automagically initiate new requests to arbitrary urls.
Such as it is, I believe that the following works in some version of parsley, though I doubt its an official release.
> Also, it's a terrible default (for security reasons) to let the web pages you're parsing automagically initiate new requests to arbitrary urls.
Right. We'd have to only grab the article-id, validate that it is in fact an interger in the right range, and only then piece the url back together and request it.
On the other hand, maybe just checking that we stay within the domain is enough. If the website wants to screw with us, they can send us any reply they want to any url anyway.
"At some point, these json things might as well be as readable as regex :/"
Don't feel :/ . The complexity is essential, and located in the remote website, not your code or your ideas. You still win isolating all the nasty stuff to one and precisely one location. :/ is on them, not you!
You will probably run into a healthy mix of "that's cool" / "I did that before you!" / "but how will it make money?". Ignore it and do your thing. If you figure out how to monetize it, great! Even if you don't or if you have no desire to, you will have learned and grown during the course of the project. That is invaluable.
Have fun and screw the haters...