Not too long ago I built a small webcrawler using Node.js, figuring that crawlers spend most of their time waiting (e.g. downloading) and therefore Node.js would be well suited. At the time I found crawlers written in Python were fairly slow, which is not a surprise. It is backed by Redis and is pretty fast even on a single process. https://github.com/brendonboshell/supercrawler
Javascript is the language to build a smaller crawler in because web pages run Javascript and you crawler needs to also. A C++ or Golang crawler will be a little faster and use a lot less memory - but you have to compile in webkit and do a bunch of hacky stuff to run the pages Javascript.
On the other hand, if you are building a massive crawler you want to split the crawling and parsing into two separate functions and do the network I/O in golang/c and do the parsing with a Javascript headless browser like phantom.
I don't really see any reason to use python unless you haven't learned golang (the world's biggest crawler's own language).
I've just run a small test (crawling a server running locally) and it comes out at 243 pages per second with one process. This crawls a webpage, adds its links to the queue and saves the URL in a Redis set. This is running on a Macbook Pro.
The selling point of Node.js is asynchronous I/O. I'm sure you mean bandwidth rather than network latency - in which case that is really not a limiting factor when running in a datacenter (40 Gbps in at Linode for example).
Some library in python ,such as asyncio or gevent could do some work asynchronously and efficently. I will have a test later for these library. In the meanwhile , welcome to post more details about asynchronous of Node.js. Thanks for your comment again!
Celery will typically use lightweight "threads" like greenlets or eventlets. I don't think it use multiple processes insofar as we're talking about python where process == core.
You can do greenlets or prefork concurrency. With prefork concurrency you get 1 process per fork. Gives you concurrency at the expense of a bit of memory.
I was hoping to compare resource utilization and performance between 40 Docker instances each with 20 connections vs. 1 process.
It's not even clear whether or not the author actually hit any external websites: In order to have a quick test , I just build a nginx hello page in my cloud server. Then scale it up to a list of 1000000.
It's just a tutorial about how to use docker and celery to build a distributed system. I will have more test about the performance of multiprocess/threads/concurrencies or some other library that supports these technics. In this case, if you don't know how to or even don't want to build a test web server, some big sites like example.com could be your choice. BUT please be gentle to these public sites.
Your bottleneck should probably be managing your requests, not saving the document/assets. I'm dismissive that language speed is non-negligible in scrapping. Rate limiting and being smart about how you're fetching data should probably be your concern. ... sure, if you don't mind slamming a server with concurrent requests, your language choice might start to matter if your IP isn't blocked first.
Python 3's asyncio and aiohttp do the job for me -- I can crawl several times as fast as the article's 111 qps with just a single process. https://github.com/cocrawler/cocrawler
Of course, it matters what you're doing with the page content, and how you're managing your metadata, and all that.
You are right, asyncio and aiohttp is a great combination. I've used them both before. Though aiohttp cost less memory than requests, aiohttp does't support https proxy. This is the only one that isn't so perfect.
Besides , asyncio is something the same as concurrency of celery. I will have a test which one will have a better performance.
Did I read that right? "it's necessary to deploy docker clusters to maximize performance of your machine" to get the performance out of a single system?
I had the same initial thought too - but he said he's creating a crawler that is meant to be distributed - in which case it's fine to use Docker since it makes the deployment on multiple machines simpler.
Is there really much overhead of wrapping processes in docker containers vs orchestrating them via a process manager? Since containers are basically a set of mounts and namespaces, what memory overhead do containerized processes incur that non-containerized processes do not? I am under the impression that a container does not add very much memory overhead itself; it's the process(es) inside the containers that add memory overhead. Please correct me if I'm wrong.
Didn't really mean Docker was the cause of the memory usage. I mean it might add a little overhead, but afaict the article's memory usage comes from the fact he's using a bunch of heavy python libraries making each process come to about 300mb and running 40 workers.
You could get the same performance within 600mb by using 2 processes each running 20 threads.
There is no avoiding the GIL within a single Python process (even with asyncio IIRC, though I've been using JS lately). Multiprocessing is usually the most efficient way to execute I/O intensive, independent parallel operations. Of course you can also run threads within each process.
I do wonder where the 300mb memory is coming from. Surely it can't all be python interpreter? It doesn't look like he's importing 300mb of modules, unless MongoClient really is that big. In that case he could create a separate worker process for persisting data, and only that worker process needs to load the MongoClient module.
One explanation for the memory overhead might be conntrack tables within the network namespace of the container. However I would expect that conntrack table to be on the host, where SNAT is performed. As an aside, the default Docker networking configuration is really not well suited to concurrent network requests, whether inbound or outbound. If you can avoid NAT (and therefore a conntrack table), that is preferable.
Thanks for your comment. Great article for networking. I'm not so familiar with NAT of linux, so could you post more details about it with python or docker, performance/advantage or something else?
Sorry for that incomplete description in my article. Because of GIL of python, python application could not make full use of machines. Of course, some libraries like gevent/asyncio could fix it well. I will have a test to find out which one is the best, include docker.
In the Clojure side of things, I recently used this [1] to scrape/parse ~4m pages in a few hours. It's very plug-and-play, but maintains a pretty decent amount of extensibility. Parsing using Tika turned out to be extremely useful.
While it's on topic.. anyone have any other recommendations for web crawlers? I'm particularly interested in finding unique identifiers (phone numbers, emails) and their contexts on gov-owned websites for a project.
I know the benchmarks cannot really be compared, since one involves a queue, and mongo, while the other does not. But, it seems like a prime use case for async.
Could also use multiprocessing, got about ~500req/s returning a 'hello world' response (which the article also does). The article does about 300req/s but that's because he saturates his pipe. The reality is the article might be faster than 1,000,000/hour.
from multiprocessing import Pool
from requests import get
urls = 1000 * ['http://localhost/hello']
def scrape(url):
return get(url).text
p = Pool(40)
results = p.map(scrape, urls)
As I said, the benchmark is flawed since it's dependent on the network pipe. It would be a good idea to run tests locally so you get a real maximum.
There are lots of factors involved which can completely skew benchmarks, for example, if you were scraping an average 10kb response instead of 'hello world' you would automatically be limited to 100req/s on a 10mbit pipe.
Thanks for your comment. I've used multiprocessing/threads/geven/asyncio before. And I will have a full test with these libraries or tools.
This post is just a quick demo to build a distributed crawler with docker. Asyncio and aiohttp is a great combination for this case , using less memory and faster. But aiohttp only support http proxy, perhaps this is the only case not so perfect.
Thanks for your comment again. welcome to discuss more technical details about it.
Seems like this wouldn't really be useful to scrape js rendered content or any content of "real" value that had any kind of rate limiting or monitoring enabled. Spreading the ip space and making scraping look like genuine user input is a far greater challenge than spinning up a RMQ cluster.
You are right. But with more codes or tools , it could do this too. It's just a quick demo for distributed crawler. If you moniter traffic of your target website with js rendered content, you will find json file and json api. And what you need next is just the same code in my article.
I use Celery inside Docker, mostly for the lazy-ops advantages; makes it very simple to bring up new pools of Celery workers, shut them all down, and mix projects on the same host to maximize its utilization.
Generally I'm getting fond of containers as a mechanism to encapsulate deployments e.g. in Python which have a lot requirements and which I've found finicky to make portable.
Full disclosure: I do something even worse, have containers which pull updates when I like with deploy keys, and run Celery etc in a virtualenv in the container... :P
The latter feels truly shameful but it does make it easy to keep the project contained even when running outside a container...
also docker makes it trivial to link a bunch of swarm hosts together, scaling this across multiple machines would basically be free as he added them to the swarm.