| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kmike84 4426 days ago

I'm not sure how would you design a library for event-loop based website navigation when an event loop is explicit. Scrapy (which is a wrapper over Twisted) is already quite close to this IMHO. You can plug anything to the same event loop if needed (think twisted web services, etc).

You can parallelize synchronous mechanize/requests scripts via celery, but it is less efficient in terms of resource usage if the bottleneck is I/O; also, it has larger fixed costs per each task.

N Scrapy processes, each processing 1/N of total urls is an easy enough way to distribute load; if that is not enough then a shared queue like https://github.com/darkrho/scrapy-redis is also an option.

I think it is not "scrapy" way of doing things that causes the problems, it is an inherent complexity of concurrency; you either give up some concurrency or build your solution around it.