| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by j_s 3402 days ago
	Can you beat the article's 800 concurrent connections & 12GB RAM used to scrape 100000 pages in 15 minutes, with just one process? Not close to a real comparison without the same URLs, but still fun to compare.

3 comments

beejiu 3402 days ago

I've just run a small test (crawling a server running locally) and it comes out at 243 pages per second with one process. This crawls a webpage, adds its links to the queue and saves the URL in a Redis set. This is running on a Macbook Pro.

link

dchuk 3402 days ago

So you eliminated the biggest cause of slow down in crawling, network latency, and are asserting yours is faster?

link

beejiu 3402 days ago

The selling point of Node.js is asynchronous I/O. I'm sure you mean bandwidth rather than network latency - in which case that is really not a limiting factor when running in a datacenter (40 Gbps in at Linode for example).

link

plantpark 3402 days ago

Some library in python ,such as asyncio or gevent could do some work asynchronously and efficently. I will have a test later for these library. In the meanwhile , welcome to post more details about asynchronous of Node.js. Thanks for your comment again!

link

gbrits 3402 days ago

Network latency would be a valid concern for avg. time/req. not for throughput (necessarily)

link

j_s 3402 days ago

Nice! (compared to the article's 100,000 pages / 15 minutes / 60 seconds = ~110 pages per second)

Did you happen to track memory usage at all? It would take a while to settle down, for sure.

I'm always interested in the amount of overhead Docker brings to the table. No biggie either way, thanks for sharing these details.

link

plantpark 3402 days ago

Thanks for your comment, I will have a test with a local server later and find out what's the upper limit of mine

link

wopwopwop 3402 days ago

I'm not an expert, but doesn't celery use multiple processes? I hope I'm not saying anything stupid, but I thought that's what "workers" were.

link

staticautomatic 3402 days ago

Celery will typically use lightweight "threads" like greenlets or eventlets. I don't think it use multiple processes insofar as we're talking about python where process == core.

link

brianwawok 3401 days ago

You can do greenlets or prefork concurrency. With prefork concurrency you get 1 process per fork. Gives you concurrency at the expense of a bit of memory.

link

plantpark 3402 days ago

celery has a prefork pool that could take use of multiple processes.

link

zepolen 3402 days ago

Article doesn't do that all in one process either.

link

j_s 3402 days ago

Exactly!

I was hoping to compare resource utilization and performance between 40 Docker instances each with 20 connections vs. 1 process.

It's not even clear whether or not the author actually hit any external websites: In order to have a quick test , I just build a nginx hello page in my cloud server. Then scale it up to a list of 1000000.

link

plantpark 3402 days ago

Sorry for incomplete description in my article.

It's just a tutorial about how to use docker and celery to build a distributed system. I will have more test about the performance of multiprocess/threads/concurrencies or some other library that supports these technics. In this case, if you don't know how to or even don't want to build a test web server, some big sites like example.com could be your choice. BUT please be gentle to these public sites.

Thanks for your comment.

link