An open source API for web scraping

Y	Hacker News new \| ask \| show \| jobs

	An open source API for web scraping (github.com)
	19 points by owainlewis 4009 days ago

6 comments

owainlewis 4009 days ago

An example showing how to grab all the stories from the Hacker News homepage

https://falkor-api.herokuapp.com/api/query?url=https://news....

link

_jomo 4009 days ago

Title should probably contain 'Show HN:' ?

Very interesting though. Just tried scraping twitter and it works great: https://falkor-api.herokuapp.com/api/query?url=https://twitt...

Edit: works great as long as there are no quotes, hashtags, or links in the tweets. Is it possible to include sub-elements?

So basically this is a DOM API in JSON. Simple, but I like it.

Any plans to add JSONP support?

link

owainlewis 4009 days ago

Hey. Thanks. Yeah I will add a ton of features over the next few days. JSONP should be an easy one. Feel free to add an issue in Github and I'll get it done for you.

Only really started hacking around on the idea the other day so early stages. Want to add filters so you can say "grab me only the text" or "grab me just the class names". Obviously another step would be to grab multiple elements in one request.

link

getriver 4009 days ago

A better error message would be helpful. For example I tried to do: https://falkor-api.herokuapp.com/api/query?url=https://kodin..., all I got was "Request failed"

link

owainlewis 4008 days ago

That's a good point. I pretty much wrote this in an evening or two so haven't had time to refine it much. But yeah error messages will definitely be improved. It's because of the way URLs are handled in the underlying web app. Will be an easy fix.

link

Jake232 4009 days ago

Cool idea. This could easily be extended to support something like a proxy pool; that way you can rate limit / rotate proxies for X domain globally at this server level. That way it's across all your projects, rather than having to do it on a per project basis.

Adding xPath support as well as CSS selectors would be a good addition.

link

owainlewis 4009 days ago

Will definitely do something with caching and rate limiting when I get some time. These queries are quite expensive so definitely needs a bit of work in those areas.

link

owainlewis 4009 days ago

An example query that extracts all the images from the Digg.com homepage.

https://falkor-api.herokuapp.com/api/query?url=http://digg.c...

link

curiously 4009 days ago

Pretty interesting. Wrote a web scraping api you can paste in to your browser and download results last year but took it down to work on another project. You can take look at what a url could look like.

https://web.archive.org/web/20140420162639/http://scrape.ly/

For example if you wanted the profile of authors of today's stories

    http://scrape.ly/s/{http://news.combination.com}
    {'ueoma87'}*{'next':'Next Page'}{'karma':'331', 
    'username':'ueoma87'}

Would've returned all the profiles of each story's author today and yesterday and so on.

link

owainlewis 4008 days ago

Thanks. This looks really interesting. I may well borrow some ideas ; )

link