Hacker News new | ask | show | jobs
by MichaelSalib 4819 days ago
Zookeeper is a distributed coordination service. Think of it as an extremely robust reliable datastore for handling small amounts of data. It provides that robustness by using an expensive synchronization protocol. When you try and slam it with large volumes of data, zookeeper falls over. And Storm relies on Zookeeper for basic functioning, so without a running zookeeper ensemble, the associated Storm cluster will die too.
2 comments

I wish it were a bit more robust than it is. The ZooKeeper version we run (3.3.4, admittedly not the newest) reports the wrong version number (3.3.3) and has a major bug in the way it does snapshots. We found that it doesn't serialize the tree of nodes to disk correctly so there is a race condition where it writes a node even though the parent of that node has been deleted. Then ZK tries to reload from the flawed snapshot but it cannot so it crashes which results in endless leader elections that never resolve..

All software has bugs and these specific problems have been fixed in newer versions, but they are super scary issues to run into with your distributed coordination service.

That makes sense. It's not clear to me though why error logging should belong to it.
Well, a Storm "program" operates concurrently on many nodes at once. If an exception is thrown, you may want to log it and the stack trace, but where? If you write to a local log file, that data will be useless unless you run some sort of log shipping or log centralization (like with scribe or kafka or syslogng). But that's usually a pain in neck to setup and you can't run storm without already running a zookeeper cluster, so if you're lazy, you just log to zookeeper.

Everything is fine as long as exceptions are infrequent.

Indeed.

All our error logging is entirely away from the live kit to prevent shit like this happening.