Hacker News new | ask | show | jobs
by tst 4379 days ago
I can recommend scrapy[0] if you work on a bit bigger problem. But even then if you familiar with scrapy it's incredible fast to write a simple scraper with your data neatly exported in .json.

[0]: http://scrapy.org/

2 comments

I don't recommend scrapy. Classic example of a framework that should have been a library. It will work up until a point and then it will railroad your app and you will have a really painful time breaking out of the 'scrapy' way of doing things. Classic 'framework' problem.

I prefer a combination of celery (distributed task management), mechanize (pretend web browser) and pyquery (jquery selectors for python).

Agreed. I used BeautifulSoup in combination with Celery.

To me scraping is such a specific thing it's best to write your own 'framework'.

I'm not sure how would you design a library for event-loop based website navigation when an event loop is explicit. Scrapy (which is a wrapper over Twisted) is already quite close to this IMHO. You can plug anything to the same event loop if needed (think twisted web services, etc).

You can parallelize synchronous mechanize/requests scripts via celery, but it is less efficient in terms of resource usage if the bottleneck is I/O; also, it has larger fixed costs per each task.

N Scrapy processes, each processing 1/N of total urls is an easy enough way to distribute load; if that is not enough then a shared queue like https://github.com/darkrho/scrapy-redis is also an option.

I think it is not "scrapy" way of doing things that causes the problems, it is an inherent complexity of concurrency; you either give up some concurrency or build your solution around it.

Scrapy spider that is doing exactly the same::

    # It requires scrapy from github.
    # Save it to tickets.py and execute 
    # "scrapy runspider tickets.py" from the command line

    from urlparse import urljoin
    import scrapy
    
    class TicketSpider(scrapy.Spider):
        name = 'tickets'
        start_urls = ['http://philadelphia.craigslist.org/search/sss?sort=date&query=firefly%20tickets']
    
        def parse(self, response):
            for listing in response.css('p.row'):
                price_txt = listing.css('span.price').re('(\d+)')
                if not price_txt:
                    continue
                price = int(price_txt[0])
                if 100 < price <= 250:
                    url = urljoin(response.url, listing.css('a::attr(href)').extract()[0])
                    print ' '.join(listing.css('::text').extract())
                    print url
                    print
There is no reason to prefer Scrapy for extracting information from a single webpage, but on the other hand it is not any harder than BS+pyquery+requests.