Hacker News new | ask | show | jobs
by inovica 5657 days ago
A good read and very timely from my perspective. We created a crawler in Python a couple of years ago for RSS feeds, but we ran into a number of issues with it, so put it on hold as we concentrated on work that made money :) We started to look at the project last week and we've been looking at rolling our own versus looking at frameworks like Scrapy. The main thing for us is being able to scale. Anyone who has knowledge of creating a distributed crawler in Python I'd welcome some advice.

Thanks again. Really good post

2 comments

After having written the thesis and thought about that stuff for another few weeks, my résumé would be:

- Use asynchronous I/O to maximize single-node speed (twisted should be a good choice for python). It might be strange in the beginning, but it usually makes up for it, especially with languages that aren't good at threading (ruby, python, ...).

- Redis is awesome! Fast, functional, beautiful :)

- Riak seems to be a great distributed datastore if you really have to scale over multiple nodes.

- Solr or Sphinx are just better optimized than most datastores when it comes to fulltext-search

- Take a day to look at graph databases (I'm still not 100% sure if I could have used one for my use cases)

Thanks for the tips! I really appreciate it. I'll check these out. All getting very exciting for my Christmas project!
If you are just doing RSS feeds I would say go it yourself. Armed with Feedparser (http://feedparser.org/) you can implement what you want pretty quickly.

For both http://www.searchforphp.com/ and http://www.searchforpython.com/ I wrote my own RSS reader. To make it scale out I just used Pythons multiprocessing to parse it out to 50 or so concurrent downloads. I can tear through thousands or feeds pretty quickly that way. The next step to multiple machines is just throw in a queue system and get a list of feeds from it.

Pretty simple stuff really.