Hacker News new | ask | show | jobs
Build a crawler to crawl million pages with only one machine in just 2 hours (medium.com)
134 points by plantpark 3402 days ago
9 comments

Not too long ago I built a small webcrawler using Node.js, figuring that crawlers spend most of their time waiting (e.g. downloading) and therefore Node.js would be well suited. At the time I found crawlers written in Python were fairly slow, which is not a surprise. It is backed by Redis and is pretty fast even on a single process. https://github.com/brendonboshell/supercrawler
Javascript is the language to build a smaller crawler in because web pages run Javascript and you crawler needs to also. A C++ or Golang crawler will be a little faster and use a lot less memory - but you have to compile in webkit and do a bunch of hacky stuff to run the pages Javascript.

On the other hand, if you are building a massive crawler you want to split the crawling and parsing into two separate functions and do the network I/O in golang/c and do the parsing with a Javascript headless browser like phantom.

I don't really see any reason to use python unless you haven't learned golang (the world's biggest crawler's own language).

No one else has mentioned it, but evaluating javascript on random webpages is something that one would need to be deeply careful about.
Care to elaborate?
Can you beat the article's 800 concurrent connections & 12GB RAM used to scrape 100000 pages in 15 minutes, with just one process?

Not close to a real comparison without the same URLs, but still fun to compare.

I've just run a small test (crawling a server running locally) and it comes out at 243 pages per second with one process. This crawls a webpage, adds its links to the queue and saves the URL in a Redis set. This is running on a Macbook Pro.
So you eliminated the biggest cause of slow down in crawling, network latency, and are asserting yours is faster?
The selling point of Node.js is asynchronous I/O. I'm sure you mean bandwidth rather than network latency - in which case that is really not a limiting factor when running in a datacenter (40 Gbps in at Linode for example).
Some library in python ,such as asyncio or gevent could do some work asynchronously and efficently. I will have a test later for these library. In the meanwhile , welcome to post more details about asynchronous of Node.js. Thanks for your comment again!
Network latency would be a valid concern for avg. time/req. not for throughput (necessarily)
Nice! (compared to the article's 100,000 pages / 15 minutes / 60 seconds = ~110 pages per second)

Did you happen to track memory usage at all? It would take a while to settle down, for sure.

I'm always interested in the amount of overhead Docker brings to the table. No biggie either way, thanks for sharing these details.

Thanks for your comment, I will have a test with a local server later and find out what's the upper limit of mine
I'm not an expert, but doesn't celery use multiple processes? I hope I'm not saying anything stupid, but I thought that's what "workers" were.
Celery will typically use lightweight "threads" like greenlets or eventlets. I don't think it use multiple processes insofar as we're talking about python where process == core.
You can do greenlets or prefork concurrency. With prefork concurrency you get 1 process per fork. Gives you concurrency at the expense of a bit of memory.
celery has a prefork pool that could take use of multiple processes.
Article doesn't do that all in one process either.
Exactly!

I was hoping to compare resource utilization and performance between 40 Docker instances each with 20 connections vs. 1 process.

It's not even clear whether or not the author actually hit any external websites: In order to have a quick test , I just build a nginx hello page in my cloud server. Then scale it up to a list of 1000000.

Sorry for incomplete description in my article.

It's just a tutorial about how to use docker and celery to build a distributed system. I will have more test about the performance of multiprocess/threads/concurrencies or some other library that supports these technics. In this case, if you don't know how to or even don't want to build a test web server, some big sites like example.com could be your choice. BUT please be gentle to these public sites.

Thanks for your comment.

Python shouldn't be any slower than Node for crawling if you use the right tools.
Oh boy. When I lived in San Francisco the Python community had this on their coffee mugs :O
Your bottleneck should probably be managing your requests, not saving the document/assets. I'm dismissive that language speed is non-negligible in scrapping. Rate limiting and being smart about how you're fetching data should probably be your concern. ... sure, if you don't mind slamming a server with concurrent requests, your language choice might start to matter if your IP isn't blocked first.
Check out scrapy, written in python. It also uses asynchronous io. Also, asyncio in python3 should make writing an async crawler a lot easier nowdays.
Thanks for your comment. This article is a tutorial that tell people how to build a distributed system with docker and celry easily.

Scrapy is good framework for crawler. I used it before, but it doesn't have some features I want. Writing a brand new one is easier for me.

Of course , I am open to compare the special performances of these two crawlers. Welcome to have a discuss more technical details here.

Thanks for your comment again.

Because crawlers spend most of their time waiting, you might as well use python's gevents.
Python 3's asyncio and aiohttp do the job for me -- I can crawl several times as fast as the article's 111 qps with just a single process. https://github.com/cocrawler/cocrawler

Of course, it matters what you're doing with the page content, and how you're managing your metadata, and all that.

Thanks for your comment.

You are right, asyncio and aiohttp is a great combination. I've used them both before. Though aiohttp cost less memory than requests, aiohttp does't support https proxy. This is the only one that isn't so perfect.

Besides , asyncio is something the same as concurrency of celery. I will have a test which one will have a better performance.

Thanks for your comment again.

Did I read that right? "it's necessary to deploy docker clusters to maximize performance of your machine" to get the performance out of a single system?
I had the same initial thought too - but he said he's creating a crawler that is meant to be distributed - in which case it's fine to use Docker since it makes the deployment on multiple machines simpler.

However that ram usage though...ugh

Is there really much overhead of wrapping processes in docker containers vs orchestrating them via a process manager? Since containers are basically a set of mounts and namespaces, what memory overhead do containerized processes incur that non-containerized processes do not? I am under the impression that a container does not add very much memory overhead itself; it's the process(es) inside the containers that add memory overhead. Please correct me if I'm wrong.
Didn't really mean Docker was the cause of the memory usage. I mean it might add a little overhead, but afaict the article's memory usage comes from the fact he's using a bunch of heavy python libraries making each process come to about 300mb and running 40 workers.

You could get the same performance within 600mb by using 2 processes each running 20 threads.

But I guess hardware is cheap.

2 processes = 2 GIL

There is no avoiding the GIL within a single Python process (even with asyncio IIRC, though I've been using JS lately). Multiprocessing is usually the most efficient way to execute I/O intensive, independent parallel operations. Of course you can also run threads within each process.

I do wonder where the 300mb memory is coming from. Surely it can't all be python interpreter? It doesn't look like he's importing 300mb of modules, unless MongoClient really is that big. In that case he could create a separate worker process for persisting data, and only that worker process needs to load the MongoClient module.

One explanation for the memory overhead might be conntrack tables within the network namespace of the container. However I would expect that conntrack table to be on the host, where SNAT is performed. As an aside, the default Docker networking configuration is really not well suited to concurrent network requests, whether inbound or outbound. If you can avoid NAT (and therefore a conntrack table), that is preferable.

This stack could also benefit from tuning some kernel parameters, both within the containers and on the host. Great blog post with details: https://blog.packagecloud.io/eng/2017/02/06/monitoring-tunin...

Thanks for your comment. Great article for networking. I'm not so familiar with NAT of linux, so could you post more details about it with python or docker, performance/advantage or something else?

Thanks again.

> 2 processes = 2 GIL

Whats your point? 20 threads will still run per GIL, and assuming a dual core cpu, 2 processes x 20 threads each will still run 40 workers.

Sorry for that incomplete description in my article. Because of GIL of python, python application could not make full use of machines. Of course, some libraries like gevent/asyncio could fix it well. I will have a test to find out which one is the best, include docker.

Thanks for your comment.

In the Clojure side of things, I recently used this [1] to scrape/parse ~4m pages in a few hours. It's very plug-and-play, but maintains a pretty decent amount of extensibility. Parsing using Tika turned out to be extremely useful.

While it's on topic.. anyone have any other recommendations for web crawlers? I'm particularly interested in finding unique identifiers (phone numbers, emails) and their contexts on gov-owned websites for a project.

[0] https://github.com/junjiemars/itsy

Great crawler, Thanks for your share.
Agreed. It's probably save me over 100 hours of work in the past two months.
Wouldn't it be more appropriate to use something like aiohttp? https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22... With no docker, or anything like that.

I know the benchmarks cannot really be compared, since one involves a queue, and mongo, while the other does not. But, it seems like a prime use case for async.

Could also use multiprocessing, got about ~500req/s returning a 'hello world' response (which the article also does). The article does about 300req/s but that's because he saturates his pipe. The reality is the article might be faster than 1,000,000/hour.

    from multiprocessing import Pool
    from requests import get
    urls = 1000 * ['http://localhost/hello']
    def scrape(url):
        return get(url).text
    p = Pool(40)
    results = p.map(scrape, urls)
~2.2 seconds on a dual core 2.2ghz
Thanks for your comment. If you have the same test with cloud server or some public website , perhaps it will decrease some.

I've used multiprocessing/threads/geven/asyncio before. And I will have a full test with these libraries.

Thanks again!

As I said, the benchmark is flawed since it's dependent on the network pipe. It would be a good idea to run tests locally so you get a real maximum.

There are lots of factors involved which can completely skew benchmarks, for example, if you were scraping an average 10kb response instead of 'hello world' you would automatically be limited to 100req/s on a 10mbit pipe.

Thanks for your comment. I've used multiprocessing/threads/geven/asyncio before. And I will have a full test with these libraries or tools. This post is just a quick demo to build a distributed crawler with docker. Asyncio and aiohttp is a great combination for this case , using less memory and faster. But aiohttp only support http proxy, perhaps this is the only case not so perfect.

Thanks for your comment again. welcome to discuss more technical details about it.

You might want to set a specific user-agent for your crawler
Seems like this wouldn't really be useful to scrape js rendered content or any content of "real" value that had any kind of rate limiting or monitoring enabled. Spreading the ip space and making scraping look like genuine user input is a far greater challenge than spinning up a RMQ cluster.
You are right. But with more codes or tools , it could do this too. It's just a quick demo for distributed crawler. If you moniter traffic of your target website with js rendered content, you will find json file and json api. And what you need next is just the same code in my article.
Another worthwhile article if you are building a crawler. http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-bil...
I've seen this article.Great article about distributed crawler. Mine just is a demo version of his.
Crawling described here is very inefficient. For efficient and high performance crawling I recommend libcurl and curlmulti.
Question to the experts here:

- What is the relevance of Docker here? I'm pretty sure that celery+rabbitmq are enough to do a distributed scraper...

I think the OP just drank the docker kool-aid :) It's also the future, obvs. https://circleci.com/blog/its-the-future/

> and learn how to use docker and celery

Seems the OP was learning Docker at the time? I think it just comes down to the tools you're comfortable with.

I use Celery inside Docker, mostly for the lazy-ops advantages; makes it very simple to bring up new pools of Celery workers, shut them all down, and mix projects on the same host to maximize its utilization.

Generally I'm getting fond of containers as a mechanism to encapsulate deployments e.g. in Python which have a lot requirements and which I've found finicky to make portable.

Full disclosure: I do something even worse, have containers which pull updates when I like with deploy keys, and run Celery etc in a virtualenv in the container... :P

The latter feels truly shameful but it does make it easy to keep the project contained even when running outside a container...

Not relevant at all.

It was just shoehorned.

The crux of a project such as this is maintaining a connection pool and managing it efficiently.

Also respecting robots.txt which the author barely mentions.

This is a "tool looking for a problem" kind of post.

also docker makes it trivial to link a bunch of swarm hosts together, scaling this across multiple machines would basically be free as he added them to the swarm.