Hacker News new | ask | show | jobs
by SIK 5680 days ago
I recently did something similar using anemone to crawl the website, and Hpricot to scrape each individual web page and add to the database.

Anemone is great because it can focus your crawl to only url's that match a certain pattern, which really helps you traverse a small portion of a larger website (like a University site). You can also do specific actions on pages that match a certain pattern.

For scraping, anemone natively supports nokogiri, so since you're coming from a blank slate, it might be easiest to learn nokogiri. Before discovering anemone, I had already written what needed to be done on each page in hpricot, so my code is a bit messy, but it's not that difficult to get anemone and hpricot to work together.