I know the benchmarks cannot really be compared, since one involves a queue, and mongo, while the other does not. But, it seems like a prime use case for async.
Could also use multiprocessing, got about ~500req/s returning a 'hello world' response (which the article also does). The article does about 300req/s but that's because he saturates his pipe. The reality is the article might be faster than 1,000,000/hour.
from multiprocessing import Pool
from requests import get
urls = 1000 * ['http://localhost/hello']
def scrape(url):
return get(url).text
p = Pool(40)
results = p.map(scrape, urls)
As I said, the benchmark is flawed since it's dependent on the network pipe. It would be a good idea to run tests locally so you get a real maximum.
There are lots of factors involved which can completely skew benchmarks, for example, if you were scraping an average 10kb response instead of 'hello world' you would automatically be limited to 100req/s on a 10mbit pipe.
Thanks for your comment. I've used multiprocessing/threads/geven/asyncio before. And I will have a full test with these libraries or tools.
This post is just a quick demo to build a distributed crawler with docker. Asyncio and aiohttp is a great combination for this case , using less memory and faster. But aiohttp only support http proxy, perhaps this is the only case not so perfect.
Thanks for your comment again. welcome to discuss more technical details about it.