Hacker News new | ask | show | jobs
by boie0025 4163 days ago
I had to write scrapers in Ruby for a very large application that scraped all kinds of government information from various states. We found (after a lot of pain working with very procedural scrapers) that a modified producer/consumer pattern worked well. We found that making classes for the producers (they were classes that described each page to be scraped, with methods that matched the modeled data) allowed for easy maintenance. We then created consumers that could be passed any of the page specific producer classes, and knew how to persist the scraped data.

Once I had a good pattern in place I could easily create subclasses of the data type I was trying to scrape, basically pointing each of the modeled data methods to an xpath that was specific to that page.

3 comments

I lead a team that works on several hundred bots scraping at high frequency. We also separate the problem of site structure and payload parsing, though slightly differently.

We have a low frequency discovery process that delves the site to create a representative meta-data structure. This is then read by a high frequency process to create a list of URLs to fetch and parse each time.

The behaviour can then be modified and/or work divided between processes by using command line arguments that cause filtering of the meta-data.

I too run a crawler that visits a lot of pages, although not at a particular high frequency. We visit hundreds of sites and each site then has a custom bot that essentially has two methods: find_links and extract. The first finds more links to visit on the site (e.g. navigates and follows pagination) whereas the latter finds and stores records. Is this similar to your approach?

Incidentally, at scale I find that the more tricky part is the whole orchestration (Schedule crawls, make sure resources are used most efficiently without overloading the target sites, properly detecting errors) is the hardest part.

The discovery process is crawling I suppose, but only within the same site. It is always assured that the higher speed process accesses data that we want to parse. It does no navigation.

Aside from having the physical capacity for the suite to run 24/7, our main challenge is speed. All data must be parsed, matched to other data in our database and published with the lowest possible latency.

We have pretty strict validation. Addressing errors in retrospect is preferable to publishing incorrect data.

If I understand you right, you have a lot of different data types to scrape, so essentially you have a sub-program for each data type and when a page is downloaded, you let each of these have a go at the page and emit content if it finds any? Or did I completely miss the point?
Yeah, I think we're on the same page. I just hacked together a quick example at this gist: https://gist.github.com/boie0025/ae9697eed61cbf5342a6
Thanks for the snippet - make sense.
We do something very similar & I'd love to get in touch if you'd be interested in discussing further. My email is in my profile if you'd be willing to reach out.