Hacker News new | ask | show | jobs
by arkitaip 5655 days ago
Very timely and interesting. I am currently looking for a crawler that tightly integrated with Drupal and that can be easily managed through Drupal nodes. Any suggestions on a solution for a small site that only needs to handle thousands of pages/urls?
2 comments

I don't really know what the "managed through Drupal nodes" means in this context. For a simple drupal fulltext search I can recommend apache solr ( http://drupal.org/project/apachesolr ).

For regular crawling:

I found anemone ( http://anemone.rubyforge.org/ ) to be a lovely framework for single page crawls.

Other interesting candidates:

https://github.com/hasmanydevelopers/RDaneel

http://www.redaelli.org/matteo-blog/projects/ebot/

http://nutch.apache.org/ (meh, java)

scrapy (http://scrapy.org/) is a well-documented and open source python scraping framework that I've used in a couple of projects.
Indeed, seems like a great framework.

Considering the timespan of the project, I had to rely on something I'm pretty ok at (Ruby), but I remember hitting a lot of posts about scrapy on the way