| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by arkitaip 5703 days ago
	Very timely and interesting. I am currently looking for a crawler that tightly integrated with Drupal and that can be easily managed through Drupal nodes. Any suggestions on a solution for a small site that only needs to handle thousands of pages/urls?

2 comments

rb2k_ 5703 days ago

I don't really know what the "managed through Drupal nodes" means in this context. For a simple drupal fulltext search I can recommend apache solr ( http://drupal.org/project/apachesolr ).

For regular crawling:

I found anemone ( http://anemone.rubyforge.org/ ) to be a lovely framework for single page crawls.

Other interesting candidates:

https://github.com/hasmanydevelopers/RDaneel

http://www.redaelli.org/matteo-blog/projects/ebot/

http://nutch.apache.org/ (meh, java)

link

toumhi 5703 days ago

scrapy (http://scrapy.org/) is a well-documented and open source python scraping framework that I've used in a couple of projects.

link

rb2k_ 5703 days ago

Indeed, seems like a great framework.

Considering the timespan of the project, I had to rely on something I'm pretty ok at (Ruby), but I remember hitting a lot of posts about scrapy on the way

link