| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wopwopwop 3402 days ago
	Because crawlers spend most of their time waiting, you might as well use python's gevents.

1 comments

wumpus 3402 days ago

Python 3's asyncio and aiohttp do the job for me -- I can crawl several times as fast as the article's 111 qps with just a single process. https://github.com/cocrawler/cocrawler

Of course, it matters what you're doing with the page content, and how you're managing your metadata, and all that.

link

plantpark 3402 days ago

Thanks for your comment.

You are right, asyncio and aiohttp is a great combination. I've used them both before. Though aiohttp cost less memory than requests, aiohttp does't support https proxy. This is the only one that isn't so perfect.

Besides , asyncio is something the same as concurrency of celery. I will have a test which one will have a better performance.

Thanks for your comment again.

link