Hacker News new | ask | show | jobs
by abecedarius 4819 days ago
One question this raised (and I don't mean this as a gotcha): why could a flood to the error-reporting servers take down all of the applications? I expected the primary fix to be to decouple the work so it could continue with no error reporting server. (But I'm not familiar with Zookeeper or any of the other work the author's doing, beyond reading some post on Storm.)
2 comments

It's a convenience thing so that users can quickly see if there are any errors happening in their applications. While you could provide hooks to integrate the error stream with some external error reporting system, you also want something that just works out of the box. Zookeeper is the only place that Storm can store state, Zookeeper is good at storing small amounts of data, and the recent errors are a small amount of data (as long as things are properly throttled). Hence, the design.
Thanks -- it's interesting to hear how people do this kind of work that I'm not involved in these days.
Zookeeper is a distributed coordination service. Think of it as an extremely robust reliable datastore for handling small amounts of data. It provides that robustness by using an expensive synchronization protocol. When you try and slam it with large volumes of data, zookeeper falls over. And Storm relies on Zookeeper for basic functioning, so without a running zookeeper ensemble, the associated Storm cluster will die too.
I wish it were a bit more robust than it is. The ZooKeeper version we run (3.3.4, admittedly not the newest) reports the wrong version number (3.3.3) and has a major bug in the way it does snapshots. We found that it doesn't serialize the tree of nodes to disk correctly so there is a race condition where it writes a node even though the parent of that node has been deleted. Then ZK tries to reload from the flawed snapshot but it cannot so it crashes which results in endless leader elections that never resolve..

All software has bugs and these specific problems have been fixed in newer versions, but they are super scary issues to run into with your distributed coordination service.

That makes sense. It's not clear to me though why error logging should belong to it.
Well, a Storm "program" operates concurrently on many nodes at once. If an exception is thrown, you may want to log it and the stack trace, but where? If you write to a local log file, that data will be useless unless you run some sort of log shipping or log centralization (like with scribe or kafka or syslogng). But that's usually a pain in neck to setup and you can't run storm without already running a zookeeper cluster, so if you're lazy, you just log to zookeeper.

Everything is fine as long as exceptions are infrequent.

Indeed.

All our error logging is entirely away from the live kit to prevent shit like this happening.