Hacker News new | ask | show | jobs
by beejiu 3402 days ago
Not too long ago I built a small webcrawler using Node.js, figuring that crawlers spend most of their time waiting (e.g. downloading) and therefore Node.js would be well suited. At the time I found crawlers written in Python were fairly slow, which is not a surprise. It is backed by Redis and is pretty fast even on a single process. https://github.com/brendonboshell/supercrawler
5 comments

Javascript is the language to build a smaller crawler in because web pages run Javascript and you crawler needs to also. A C++ or Golang crawler will be a little faster and use a lot less memory - but you have to compile in webkit and do a bunch of hacky stuff to run the pages Javascript.

On the other hand, if you are building a massive crawler you want to split the crawling and parsing into two separate functions and do the network I/O in golang/c and do the parsing with a Javascript headless browser like phantom.

I don't really see any reason to use python unless you haven't learned golang (the world's biggest crawler's own language).

No one else has mentioned it, but evaluating javascript on random webpages is something that one would need to be deeply careful about.
Care to elaborate?
Can you beat the article's 800 concurrent connections & 12GB RAM used to scrape 100000 pages in 15 minutes, with just one process?

Not close to a real comparison without the same URLs, but still fun to compare.

I've just run a small test (crawling a server running locally) and it comes out at 243 pages per second with one process. This crawls a webpage, adds its links to the queue and saves the URL in a Redis set. This is running on a Macbook Pro.
So you eliminated the biggest cause of slow down in crawling, network latency, and are asserting yours is faster?
The selling point of Node.js is asynchronous I/O. I'm sure you mean bandwidth rather than network latency - in which case that is really not a limiting factor when running in a datacenter (40 Gbps in at Linode for example).
Some library in python ,such as asyncio or gevent could do some work asynchronously and efficently. I will have a test later for these library. In the meanwhile , welcome to post more details about asynchronous of Node.js. Thanks for your comment again!
Network latency would be a valid concern for avg. time/req. not for throughput (necessarily)
Nice! (compared to the article's 100,000 pages / 15 minutes / 60 seconds = ~110 pages per second)

Did you happen to track memory usage at all? It would take a while to settle down, for sure.

I'm always interested in the amount of overhead Docker brings to the table. No biggie either way, thanks for sharing these details.

Thanks for your comment, I will have a test with a local server later and find out what's the upper limit of mine
I'm not an expert, but doesn't celery use multiple processes? I hope I'm not saying anything stupid, but I thought that's what "workers" were.
Celery will typically use lightweight "threads" like greenlets or eventlets. I don't think it use multiple processes insofar as we're talking about python where process == core.
You can do greenlets or prefork concurrency. With prefork concurrency you get 1 process per fork. Gives you concurrency at the expense of a bit of memory.
celery has a prefork pool that could take use of multiple processes.
Article doesn't do that all in one process either.
Exactly!

I was hoping to compare resource utilization and performance between 40 Docker instances each with 20 connections vs. 1 process.

It's not even clear whether or not the author actually hit any external websites: In order to have a quick test , I just build a nginx hello page in my cloud server. Then scale it up to a list of 1000000.

Sorry for incomplete description in my article.

It's just a tutorial about how to use docker and celery to build a distributed system. I will have more test about the performance of multiprocess/threads/concurrencies or some other library that supports these technics. In this case, if you don't know how to or even don't want to build a test web server, some big sites like example.com could be your choice. BUT please be gentle to these public sites.

Thanks for your comment.

Python shouldn't be any slower than Node for crawling if you use the right tools.
Oh boy. When I lived in San Francisco the Python community had this on their coffee mugs :O
Your bottleneck should probably be managing your requests, not saving the document/assets. I'm dismissive that language speed is non-negligible in scrapping. Rate limiting and being smart about how you're fetching data should probably be your concern. ... sure, if you don't mind slamming a server with concurrent requests, your language choice might start to matter if your IP isn't blocked first.
Check out scrapy, written in python. It also uses asynchronous io. Also, asyncio in python3 should make writing an async crawler a lot easier nowdays.
Thanks for your comment. This article is a tutorial that tell people how to build a distributed system with docker and celry easily.

Scrapy is good framework for crawler. I used it before, but it doesn't have some features I want. Writing a brand new one is easier for me.

Of course , I am open to compare the special performances of these two crawlers. Welcome to have a discuss more technical details here.

Thanks for your comment again.

Because crawlers spend most of their time waiting, you might as well use python's gevents.
Python 3's asyncio and aiohttp do the job for me -- I can crawl several times as fast as the article's 111 qps with just a single process. https://github.com/cocrawler/cocrawler

Of course, it matters what you're doing with the page content, and how you're managing your metadata, and all that.

Thanks for your comment.

You are right, asyncio and aiohttp is a great combination. I've used them both before. Though aiohttp cost less memory than requests, aiohttp does't support https proxy. This is the only one that isn't so perfect.

Besides , asyncio is something the same as concurrency of celery. I will have a test which one will have a better performance.

Thanks for your comment again.