Hacker News new | ask | show | jobs
by alexbardas 5055 days ago
Not bad at all. I build just a few months ago (not publicly released even though I plan) a crawler using NodeJS to take advantage of its evented architecture. I managed to crawl and store (in mongo) more than 300k movies from IMDB in just a few hours (using only a laptop and 8 processes), creating many processes and every one having a specified number of concurrent connections (was based on nodejs cluster and kue lib by learnboost). For html parsing, I used jsdom or cheerio (faster but incomplete), but the process of extracting and storing the data was very faster (prob less than 10 ms for a page). Kue is similar to ruby's resque or python's pyres so the advantage was that every request was basically an independent job using redis as a pubsub.

Even though your implementation is a lot complex and very well documented, IMO using non blocking I/O it's a much better solution, because crawling is very intensive I/O and most of the time is spent with the connection (request + response time). Using that many machines and processes, the time should be much shorter with node.

3 comments

> I managed to crawl [..] more than 300k movies from IMDB in just a few hours

I suppose IMDB already has a pretty good architecture to handle that load, but please, if you're crawling from a single site, be careful. I host a similar database myself, and the CPU/load graphs of my server can tell me exactly when someone has a crawler active again. That's not fun if your goal is to keep a site responsive while keeping the hosting at low cost.

Very true indeed. I was also randomly changing user-agents (Mozilla, Safari, Chrome, IE). I thought that this will be harder to tell whether there is a lot of traffic from the same network or someone is just intensively crawling the site.

For me, it was more a proof of how efficient and fast a crawler can be. Also, a response from IMDB was very fast in less than 0.4 seconds, so not that much time was lost there.

Gray hat question out of curiosity and possible experience: did you also use proxies or perhaps even Tor?
so how polite does one need to be? One hit per x seconds?
If the /robots.txt does not mention a Crawl-delay, one page per 3 seconds is often a safe value. Of course this rather heavily depends on the site. In any case, if you have any specific need, always contact the people responsible for the site. I occasionaly run custom queries against the database on request, for example.
I managed to crawl and store (in mongo) more than 300k movies from IMDB in just a few hours

Did you know that IMDB makes a subset of their data publicly available? http://www.imdb.com/interfaces/

Yes, but it's hard to tell how complete and updated that subset is. There is (was until few weeks ago apparently) also http://www.imdbapi.com/ which was retrieving their data by crawling. Unfortunately, it was shut down.
Please, do release it! I'm (or was, decided to go with Apache Nutch for the time being) in the process of creating a similar crawler (with almost the exact same "technologies" you mentioned). It would save me a lot of time and we might be able to help with fixing bug and adding features...
Ok, I'll work then at creating a documentation and adding some tests. The project was written in coffeescript and someone only needs to extend a class and implement 2 methods and a list of starting urls. Using node cluster and concurrent connections I think it can scale very well. I introduced promises (taken from Jquery Deffered) in case someone wanted the writing to DB to be synchronous.

IMO, using kue was a success because it also offers a web interface where you can check the progress and restart/check failed jobs.

Great - I'll be looking forward to it. What's your GitHub username (if you intend to publish it there) so I can follow you to be notified when it's released?