| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ycmike 4446 days ago
	HN, So who do you guys use more? Import.io or Kimono? I have heard good things about both.

4 comments

Jake232 4446 days ago

I write my own custom scrapers, I prefer the flexibility and feel safer that the service isn't going to disappear any minute.

If anybody is interested, I wrote a detailed article on scraping not so long back that was well received here: http://jakeaustwick.me/python-web-scraping-resource/

link

samstave 4445 days ago

I tried Komono, but it cannot auth into the sites I want to pull the data from....

Just grabbed import.io - will see if it can loginto sites and grab the data from services I am already paying thousands per month for.

EDIT:

To add some context: I pay about $3,000 per month for some monitoring services which do not have any real reportin mechanisms. So for my daily and weekly reports, I have to manually compile them and screen shot a ton of things, compose an email and send.

I want to configure a scraper to automatically grab screens of things I want regularly and email them.

I want to have a script that will grab many diff pieces of data (visual graphs, typically) and put them all into one email.

I am working with my monitoring vendors to get them to add reporting... but until that can happen - I am tired of spending a couple hours per week screen capping graphs...

link

rch 4446 days ago

I'm evaluating these to augment a system I'm building on top of casper. This is the first I've seen of this one, but right out of the gate I think I prefer Kimono.

link

thejosh 4446 days ago

I prefer to rely on code that doesn't rely on an API that could just vanish the next day or cost a bucket to run.

link

zmmmmm 4446 days ago

> rely on an API that could just vanish the next day

Kind of ironic that you are saying this about web scraping ...

link

jimktrains2 4446 days ago

But then his data source is gone and what he was doing is pointless. Losing your processor of said data source while said data source is still available is frustrating.

link

ejstronge 4446 days ago

What do you use for scraping? I may have a scraping project later this year and would love recommendations.

link

PuerkitoBio 4446 days ago

I've written a couple "polite" crawlers in Go (i.e. obeys robots.txt, delays between requests to the same host).

- Fetchbot: https://github.com/PuerkitoBio/fetchbot

Flexible, similar API to net/http (uses a Handler interface with a simple mux provided, supports middleware, etc.)

- gocrawl: https://github.com/PuerkitoBio/gocrawl

Higher-level, more framework than library.

Coupled with goquery (https://github.com/PuerkitoBio/goquery ) to scrape the dom (well, the net/html nodes), this makes custom scrapers trivial to write.

(sorry for the self-promoting comment, but this is quite on topic)

edit: polite crawlers, not scrapers.

link

djm_ 4446 days ago

Scrapy gets a solid recommendation from me. http://scrapy.org/

link

frabcus 4446 days ago

We've got quite an old mailing list full of geeks hand-coding web scrapers, if you want somewhere to ask questions:

https://groups.google.com/forum/#!forum/scraperwiki

link

egeozcan 4446 days ago

I use custom node.js scripts with these libraries:

* request - https://github.com/mikeal/request

* async - https://github.com/caolan/async

* cheerio - https://github.com/cheeriojs/cheerio

* nedb - https://github.com/louischatriot/nedb

link