Hacker News new | ask | show | jobs
by ycmike 4446 days ago
HN,

So who do you guys use more? Import.io or Kimono? I have heard good things about both.

4 comments

I write my own custom scrapers, I prefer the flexibility and feel safer that the service isn't going to disappear any minute.

If anybody is interested, I wrote a detailed article on scraping not so long back that was well received here: http://jakeaustwick.me/python-web-scraping-resource/

I tried Komono, but it cannot auth into the sites I want to pull the data from....

Just grabbed import.io - will see if it can loginto sites and grab the data from services I am already paying thousands per month for.

EDIT:

To add some context: I pay about $3,000 per month for some monitoring services which do not have any real reportin mechanisms. So for my daily and weekly reports, I have to manually compile them and screen shot a ton of things, compose an email and send.

I want to configure a scraper to automatically grab screens of things I want regularly and email them.

I want to have a script that will grab many diff pieces of data (visual graphs, typically) and put them all into one email.

I am working with my monitoring vendors to get them to add reporting... but until that can happen - I am tired of spending a couple hours per week screen capping graphs...

I'm evaluating these to augment a system I'm building on top of casper. This is the first I've seen of this one, but right out of the gate I think I prefer Kimono.
I prefer to rely on code that doesn't rely on an API that could just vanish the next day or cost a bucket to run.
> rely on an API that could just vanish the next day

Kind of ironic that you are saying this about web scraping ...

But then his data source is gone and what he was doing is pointless. Losing your processor of said data source while said data source is still available is frustrating.
What do you use for scraping? I may have a scraping project later this year and would love recommendations.
I've written a couple "polite" crawlers in Go (i.e. obeys robots.txt, delays between requests to the same host).

- Fetchbot: https://github.com/PuerkitoBio/fetchbot

Flexible, similar API to net/http (uses a Handler interface with a simple mux provided, supports middleware, etc.)

- gocrawl: https://github.com/PuerkitoBio/gocrawl

Higher-level, more framework than library.

Coupled with goquery (https://github.com/PuerkitoBio/goquery ) to scrape the dom (well, the net/html nodes), this makes custom scrapers trivial to write.

(sorry for the self-promoting comment, but this is quite on topic)

edit: polite crawlers, not scrapers.

Scrapy gets a solid recommendation from me. http://scrapy.org/
We've got quite an old mailing list full of geeks hand-coding web scrapers, if you want somewhere to ask questions:

https://groups.google.com/forum/#!forum/scraperwiki

I use custom node.js scripts with these libraries:

* request - https://github.com/mikeal/request

* async - https://github.com/caolan/async

* cheerio - https://github.com/cheeriojs/cheerio

* nedb - https://github.com/louischatriot/nedb