Hacker News new | ask | show | jobs
by wnm 4163 days ago
I recommend having a look at capybara [0]. It is build on top of nokogiri, and is actually a tool to write acceptence tests. But it can also be used for web scraping: you can open websites, click on links, fill in forms, find elements on a page (via xpath or css), get their values, etc... I prefer it over nokogiri because of its nice DSL and good documentation [1]. It also can execute javascript, which sometimes is handy for scraping.

I've spend a lot of time working on web scrapers for two of my projects, http://themescroller.com (dead) and http://www.remoteworknewsletter.com, and I think the holy grail is to build a rails app around your scraper. You can write your scrapers as libs, and then make them executable as rake tasks, or even cronjobs. And because its a rails app you can save all scraped data as actual models and have them persisted in a database. With rails its also super easy to build an api around your data, or build a quick backend for it via rails scaffolds.

[0] https://github.com/jnicklas/capybara [1] http://www.rubydoc.info/github/jnicklas/capybara/