Hacker News new | ask | show | jobs
by crdoconnor 4379 days ago
I don't recommend scrapy. Classic example of a framework that should have been a library. It will work up until a point and then it will railroad your app and you will have a really painful time breaking out of the 'scrapy' way of doing things. Classic 'framework' problem.

I prefer a combination of celery (distributed task management), mechanize (pretend web browser) and pyquery (jquery selectors for python).

2 comments

Agreed. I used BeautifulSoup in combination with Celery.

To me scraping is such a specific thing it's best to write your own 'framework'.

I'm not sure how would you design a library for event-loop based website navigation when an event loop is explicit. Scrapy (which is a wrapper over Twisted) is already quite close to this IMHO. You can plug anything to the same event loop if needed (think twisted web services, etc).

You can parallelize synchronous mechanize/requests scripts via celery, but it is less efficient in terms of resource usage if the bottleneck is I/O; also, it has larger fixed costs per each task.

N Scrapy processes, each processing 1/N of total urls is an easy enough way to distribute load; if that is not enough then a shared queue like https://github.com/darkrho/scrapy-redis is also an option.

I think it is not "scrapy" way of doing things that causes the problems, it is an inherent complexity of concurrency; you either give up some concurrency or build your solution around it.