Hacker News new | ask | show | jobs
by Bedon292 3402 days ago
Wouldn't it be more appropriate to use something like aiohttp? https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22... With no docker, or anything like that.

I know the benchmarks cannot really be compared, since one involves a queue, and mongo, while the other does not. But, it seems like a prime use case for async.

2 comments

Could also use multiprocessing, got about ~500req/s returning a 'hello world' response (which the article also does). The article does about 300req/s but that's because he saturates his pipe. The reality is the article might be faster than 1,000,000/hour.

    from multiprocessing import Pool
    from requests import get
    urls = 1000 * ['http://localhost/hello']
    def scrape(url):
        return get(url).text
    p = Pool(40)
    results = p.map(scrape, urls)
~2.2 seconds on a dual core 2.2ghz
Thanks for your comment. If you have the same test with cloud server or some public website , perhaps it will decrease some.

I've used multiprocessing/threads/geven/asyncio before. And I will have a full test with these libraries.

Thanks again!

As I said, the benchmark is flawed since it's dependent on the network pipe. It would be a good idea to run tests locally so you get a real maximum.

There are lots of factors involved which can completely skew benchmarks, for example, if you were scraping an average 10kb response instead of 'hello world' you would automatically be limited to 100req/s on a 10mbit pipe.

Thanks for your comment. I've used multiprocessing/threads/geven/asyncio before. And I will have a full test with these libraries or tools. This post is just a quick demo to build a distributed crawler with docker. Asyncio and aiohttp is a great combination for this case , using less memory and faster. But aiohttp only support http proxy, perhaps this is the only case not so perfect.

Thanks for your comment again. welcome to discuss more technical details about it.