I can recommend scrapy[0] if you work on a bit bigger problem. But even then if you familiar with scrapy it's incredible fast to write a simple scraper with your data neatly exported in .json.
I don't recommend scrapy. Classic example of a framework that should have been a library. It will work up until a point and then it will railroad your app and you will have a really painful time breaking out of the 'scrapy' way of doing things. Classic 'framework' problem.
I prefer a combination of celery (distributed task management), mechanize (pretend web browser) and pyquery (jquery selectors for python).
I'm not sure how would you design a library for event-loop based website navigation when an event loop is explicit. Scrapy (which is a wrapper over Twisted) is already quite close to this IMHO. You can plug anything to the same event loop if needed (think twisted web services, etc).
You can parallelize synchronous mechanize/requests scripts via celery, but it is less efficient in terms of resource usage if the bottleneck is I/O; also, it has larger fixed costs per each task.
N Scrapy processes, each processing 1/N of total urls is an easy enough way to distribute load; if that is not enough then a shared queue like https://github.com/darkrho/scrapy-redis is also an option.
I think it is not "scrapy" way of doing things that causes the problems, it is an inherent complexity of concurrency; you either give up some concurrency or build your solution around it.
# It requires scrapy from github.
# Save it to tickets.py and execute
# "scrapy runspider tickets.py" from the command line
from urlparse import urljoin
import scrapy
class TicketSpider(scrapy.Spider):
name = 'tickets'
start_urls = ['http://philadelphia.craigslist.org/search/sss?sort=date&query=firefly%20tickets']
def parse(self, response):
for listing in response.css('p.row'):
price_txt = listing.css('span.price').re('(\d+)')
if not price_txt:
continue
price = int(price_txt[0])
if 100 < price <= 250:
url = urljoin(response.url, listing.css('a::attr(href)').extract()[0])
print ' '.join(listing.css('::text').extract())
print url
print
There is no reason to prefer Scrapy for extracting information from a single webpage, but on the other hand it is not any harder than BS+pyquery+requests.
I prefer a combination of celery (distributed task management), mechanize (pretend web browser) and pyquery (jquery selectors for python).