|
|
|
|
|
by alexbardas
5055 days ago
|
|
Not bad at all. I build just a few months ago (not publicly released even though I plan) a crawler using NodeJS to take advantage of its evented architecture. I managed to crawl and store (in mongo) more than 300k movies from IMDB in just a few hours (using only a laptop and 8 processes), creating many processes and every one having a specified number of concurrent connections (was based on nodejs cluster and kue lib by learnboost). For html parsing, I used jsdom or cheerio (faster but incomplete), but the process of extracting and storing the data was very faster (prob less than 10 ms for a page). Kue is similar to ruby's resque or python's pyres so the advantage was that every request was basically an independent job using redis as a pubsub. Even though your implementation is a lot complex and very well documented, IMO using non blocking I/O it's a much better solution, because crawling is very intensive I/O and most of the time is spent with the connection (request + response time). Using that many machines and processes, the time should be much shorter with node. |
|
I suppose IMDB already has a pretty good architecture to handle that load, but please, if you're crawling from a single site, be careful. I host a similar database myself, and the CPU/load graphs of my server can tell me exactly when someone has a crawler active again. That's not fun if your goal is to keep a site responsive while keeping the hosting at low cost.