Hacker News new | ask | show | jobs
by fallentimes 6647 days ago
We're using a general-use multi-threaded crawler to get the pages and then using Beautiful Soup and a bit of regex to parse them. Though we are scraping multiple sites, they are all in the same "category" so to speak, so there are a lot of generic parsing methods that are simply overridden when necessary. PyParsing was played with for a while, but since data comes in so many slightly varied forms I was ending up with rules that were miles and miles long just to find a simple price or date/time on a page that would work for the largest number of sites possible.