| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thejosh 4446 days ago
	I prefer to rely on code that doesn't rely on an API that could just vanish the next day or cost a bucket to run.

2 comments

zmmmmm 4446 days ago

> rely on an API that could just vanish the next day

Kind of ironic that you are saying this about web scraping ...

link

jimktrains2 4446 days ago

But then his data source is gone and what he was doing is pointless. Losing your processor of said data source while said data source is still available is frustrating.

link

ejstronge 4446 days ago

What do you use for scraping? I may have a scraping project later this year and would love recommendations.

link

PuerkitoBio 4446 days ago

I've written a couple "polite" crawlers in Go (i.e. obeys robots.txt, delays between requests to the same host).

- Fetchbot: https://github.com/PuerkitoBio/fetchbot

Flexible, similar API to net/http (uses a Handler interface with a simple mux provided, supports middleware, etc.)

- gocrawl: https://github.com/PuerkitoBio/gocrawl

Higher-level, more framework than library.

Coupled with goquery (https://github.com/PuerkitoBio/goquery ) to scrape the dom (well, the net/html nodes), this makes custom scrapers trivial to write.

(sorry for the self-promoting comment, but this is quite on topic)

edit: polite crawlers, not scrapers.

link

djm_ 4446 days ago

Scrapy gets a solid recommendation from me. http://scrapy.org/

link

frabcus 4446 days ago

We've got quite an old mailing list full of geeks hand-coding web scrapers, if you want somewhere to ask questions:

https://groups.google.com/forum/#!forum/scraperwiki

link

egeozcan 4446 days ago

I use custom node.js scripts with these libraries:

* request - https://github.com/mikeal/request

* async - https://github.com/caolan/async

* cheerio - https://github.com/cheeriojs/cheerio

* nedb - https://github.com/louischatriot/nedb

link