| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rb2k_ 5701 days ago

Oh, node.js is definitely a great direction to go!

One of my problems was that a lot of the "usual" libraries are written in a synchronous/blocking manner behind the scenes. This is something that the node.js ecosystem would probably solve right from the start.

The downside of a relatively new library like httpClient is, that it is missing things like automatically following redirects. While this can be implemented in the crawler code, it complicates things.

How big are the datasets that vertex.js/tokyo cabinet is able to handle for you?

Node.js is on the list of things I'd like to play with a bit more (just like Scala, Erlang, graph databases, mirah, ...). Is your crawler's source code available by any chance?

1 comments

richcollins 5700 days ago

My dataset is still small, but you can scale a single TC db to nearly arbitrary size (8EB). It can also write millions of kv pairs / second.

Vertex.js can't quite keep up with TC as its written in javascript. However, it does let you batch writes into logical transactions, which you can use to get fairly high throughput.

The source isn't open as its fairly specific to my app, http://luciebot.com/. I'd be happy to chat about the details without releasing the source. richcollins@gmail.com / richcollins on freenode.

link

rb2k_ 5700 days ago

Be sure to check out this post: http://stackoverflow.com/questions/1051847/why-does-tokyo-ty...

I did some experimentation with tokyo* and experienced that slowdown myself. I just didn't want to disable journaling in the end...

link

richcollins 5699 days ago

Thanks -- I've seen that. I'm just going to make frequent backups and hope that lack of journaling doesn't bite me in the ass o_O

link