Hacker News new | ask | show | jobs
by thejosh 4446 days ago
I prefer to rely on code that doesn't rely on an API that could just vanish the next day or cost a bucket to run.
2 comments

> rely on an API that could just vanish the next day

Kind of ironic that you are saying this about web scraping ...

But then his data source is gone and what he was doing is pointless. Losing your processor of said data source while said data source is still available is frustrating.
What do you use for scraping? I may have a scraping project later this year and would love recommendations.
I've written a couple "polite" crawlers in Go (i.e. obeys robots.txt, delays between requests to the same host).

- Fetchbot: https://github.com/PuerkitoBio/fetchbot

Flexible, similar API to net/http (uses a Handler interface with a simple mux provided, supports middleware, etc.)

- gocrawl: https://github.com/PuerkitoBio/gocrawl

Higher-level, more framework than library.

Coupled with goquery (https://github.com/PuerkitoBio/goquery ) to scrape the dom (well, the net/html nodes), this makes custom scrapers trivial to write.

(sorry for the self-promoting comment, but this is quite on topic)

edit: polite crawlers, not scrapers.

Scrapy gets a solid recommendation from me. http://scrapy.org/
We've got quite an old mailing list full of geeks hand-coding web scrapers, if you want somewhere to ask questions:

https://groups.google.com/forum/#!forum/scraperwiki

I use custom node.js scripts with these libraries:

* request - https://github.com/mikeal/request

* async - https://github.com/caolan/async

* cheerio - https://github.com/cheeriojs/cheerio

* nedb - https://github.com/louischatriot/nedb