Hacker News new | ask | show | jobs
An open source API for web scraping (github.com)
19 points by owainlewis 4009 days ago
6 comments

An example showing how to grab all the stories from the Hacker News homepage

https://falkor-api.herokuapp.com/api/query?url=https://news....

Title should probably contain 'Show HN:' ?

Very interesting though. Just tried scraping twitter and it works great: https://falkor-api.herokuapp.com/api/query?url=https://twitt...

Edit: works great as long as there are no quotes, hashtags, or links in the tweets. Is it possible to include sub-elements?

So basically this is a DOM API in JSON. Simple, but I like it.

Any plans to add JSONP support?

Hey. Thanks. Yeah I will add a ton of features over the next few days. JSONP should be an easy one. Feel free to add an issue in Github and I'll get it done for you.

Only really started hacking around on the idea the other day so early stages. Want to add filters so you can say "grab me only the text" or "grab me just the class names". Obviously another step would be to grab multiple elements in one request.

A better error message would be helpful. For example I tried to do: https://falkor-api.herokuapp.com/api/query?url=https://kodin..., all I got was "Request failed"
That's a good point. I pretty much wrote this in an evening or two so haven't had time to refine it much. But yeah error messages will definitely be improved. It's because of the way URLs are handled in the underlying web app. Will be an easy fix.
Cool idea. This could easily be extended to support something like a proxy pool; that way you can rate limit / rotate proxies for X domain globally at this server level. That way it's across all your projects, rather than having to do it on a per project basis.

Adding xPath support as well as CSS selectors would be a good addition.

Will definitely do something with caching and rate limiting when I get some time. These queries are quite expensive so definitely needs a bit of work in those areas.
An example query that extracts all the images from the Digg.com homepage.

https://falkor-api.herokuapp.com/api/query?url=http://digg.c...

Pretty interesting. Wrote a web scraping api you can paste in to your browser and download results last year but took it down to work on another project. You can take look at what a url could look like.

https://web.archive.org/web/20140420162639/http://scrape.ly/

For example if you wanted the profile of authors of today's stories

    http://scrape.ly/s/{http://news.combination.com}
    {'ueoma87'}*{'next':'Next Page'}{'karma':'331', 
    'username':'ueoma87'}
Would've returned all the profiles of each story's author today and yesterday and so on.
Thanks. This looks really interesting. I may well borrow some ideas ; )