Hacker News new | ask | show | jobs
by j_s 3402 days ago
Can you beat the article's 800 concurrent connections & 12GB RAM used to scrape 100000 pages in 15 minutes, with just one process?

Not close to a real comparison without the same URLs, but still fun to compare.

3 comments

I've just run a small test (crawling a server running locally) and it comes out at 243 pages per second with one process. This crawls a webpage, adds its links to the queue and saves the URL in a Redis set. This is running on a Macbook Pro.
So you eliminated the biggest cause of slow down in crawling, network latency, and are asserting yours is faster?
The selling point of Node.js is asynchronous I/O. I'm sure you mean bandwidth rather than network latency - in which case that is really not a limiting factor when running in a datacenter (40 Gbps in at Linode for example).
Some library in python ,such as asyncio or gevent could do some work asynchronously and efficently. I will have a test later for these library. In the meanwhile , welcome to post more details about asynchronous of Node.js. Thanks for your comment again!
Network latency would be a valid concern for avg. time/req. not for throughput (necessarily)
Nice! (compared to the article's 100,000 pages / 15 minutes / 60 seconds = ~110 pages per second)

Did you happen to track memory usage at all? It would take a while to settle down, for sure.

I'm always interested in the amount of overhead Docker brings to the table. No biggie either way, thanks for sharing these details.

Thanks for your comment, I will have a test with a local server later and find out what's the upper limit of mine
I'm not an expert, but doesn't celery use multiple processes? I hope I'm not saying anything stupid, but I thought that's what "workers" were.
Celery will typically use lightweight "threads" like greenlets or eventlets. I don't think it use multiple processes insofar as we're talking about python where process == core.
You can do greenlets or prefork concurrency. With prefork concurrency you get 1 process per fork. Gives you concurrency at the expense of a bit of memory.
celery has a prefork pool that could take use of multiple processes.
Article doesn't do that all in one process either.
Exactly!

I was hoping to compare resource utilization and performance between 40 Docker instances each with 20 connections vs. 1 process.

It's not even clear whether or not the author actually hit any external websites: In order to have a quick test , I just build a nginx hello page in my cloud server. Then scale it up to a list of 1000000.

Sorry for incomplete description in my article.

It's just a tutorial about how to use docker and celery to build a distributed system. I will have more test about the performance of multiprocess/threads/concurrencies or some other library that supports these technics. In this case, if you don't know how to or even don't want to build a test web server, some big sites like example.com could be your choice. BUT please be gentle to these public sites.

Thanks for your comment.