Did I read that right? "it's necessary to deploy docker clusters to maximize performance of your machine" to get the performance out of a single system?
I had the same initial thought too - but he said he's creating a crawler that is meant to be distributed - in which case it's fine to use Docker since it makes the deployment on multiple machines simpler.
Is there really much overhead of wrapping processes in docker containers vs orchestrating them via a process manager? Since containers are basically a set of mounts and namespaces, what memory overhead do containerized processes incur that non-containerized processes do not? I am under the impression that a container does not add very much memory overhead itself; it's the process(es) inside the containers that add memory overhead. Please correct me if I'm wrong.
Didn't really mean Docker was the cause of the memory usage. I mean it might add a little overhead, but afaict the article's memory usage comes from the fact he's using a bunch of heavy python libraries making each process come to about 300mb and running 40 workers.
You could get the same performance within 600mb by using 2 processes each running 20 threads.
There is no avoiding the GIL within a single Python process (even with asyncio IIRC, though I've been using JS lately). Multiprocessing is usually the most efficient way to execute I/O intensive, independent parallel operations. Of course you can also run threads within each process.
I do wonder where the 300mb memory is coming from. Surely it can't all be python interpreter? It doesn't look like he's importing 300mb of modules, unless MongoClient really is that big. In that case he could create a separate worker process for persisting data, and only that worker process needs to load the MongoClient module.
One explanation for the memory overhead might be conntrack tables within the network namespace of the container. However I would expect that conntrack table to be on the host, where SNAT is performed. As an aside, the default Docker networking configuration is really not well suited to concurrent network requests, whether inbound or outbound. If you can avoid NAT (and therefore a conntrack table), that is preferable.
Thanks for your comment. Great article for networking. I'm not so familiar with NAT of linux, so could you post more details about it with python or docker, performance/advantage or something else?
Sorry for that incomplete description in my article. Because of GIL of python, python application could not make full use of machines. Of course, some libraries like gevent/asyncio could fix it well. I will have a test to find out which one is the best, include docker.
However that ram usage though...ugh