|
|
|
|
|
by boie0025
4163 days ago
|
|
I had to write scrapers in Ruby for a very large application that scraped all kinds of government information from various states. We found (after a lot of pain working with very procedural scrapers) that a modified producer/consumer pattern worked well. We found that making classes for the producers (they were classes that described each page to be scraped, with methods that matched the modeled data) allowed for easy maintenance. We then created consumers that could be passed any of the page specific producer classes, and knew how to persist the scraped data. Once I had a good pattern in place I could easily create subclasses of the data type I was trying to scrape, basically pointing each of the modeled data methods to an xpath that was specific to that page. |
|
We have a low frequency discovery process that delves the site to create a representative meta-data structure. This is then read by a high frequency process to create a list of URLs to fetch and parse each time.
The behaviour can then be modified and/or work divided between processes by using command line arguments that cause filtering of the meta-data.