Hacker News new | ask | show | jobs
by rb2k_ 5653 days ago
Oh, node.js is definitely a great direction to go!

One of my problems was that a lot of the "usual" libraries are written in a synchronous/blocking manner behind the scenes. This is something that the node.js ecosystem would probably solve right from the start.

The downside of a relatively new library like httpClient is, that it is missing things like automatically following redirects. While this can be implemented in the crawler code, it complicates things.

How big are the datasets that vertex.js/tokyo cabinet is able to handle for you?

Node.js is on the list of things I'd like to play with a bit more (just like Scala, Erlang, graph databases, mirah, ...). Is your crawler's source code available by any chance?

1 comments

My dataset is still small, but you can scale a single TC db to nearly arbitrary size (8EB). It can also write millions of kv pairs / second.

Vertex.js can't quite keep up with TC as its written in javascript. However, it does let you batch writes into logical transactions, which you can use to get fairly high throughput.

The source isn't open as its fairly specific to my app, http://luciebot.com/. I'd be happy to chat about the details without releasing the source. richcollins@gmail.com / richcollins on freenode.

Be sure to check out this post: http://stackoverflow.com/questions/1051847/why-does-tokyo-ty...

I did some experimentation with tokyo* and experienced that slowdown myself. I just didn't want to disable journaling in the end...

Thanks -- I've seen that. I'm just going to make frequent backups and hope that lack of journaling doesn't bite me in the ass o_O