Hacker News new | ask | show | jobs
by mikebabineau 4593 days ago
We're running 0.7 and most of our problems have been around partition rebalancing. I'm not the primary engineer on this, but here's my understanding:

If we add nodes to an existing Kafka cluster, those nodes own no partitions and therefore send/receive no traffic. A rebalancing event must occur for these servers to become active. Bouncing Kafka on one of the active nodes is one way to trigger such an event.

Fortunately, cluster resizing is infrequent. Unfortunately, network interruptions are not (at least on EC2).

When ZooKeeper detects a node failure (however brief), the node is removed from the active pool and the partitions are rebalanced. This is desirable. But when the node comes back online, no rebalancing takes place. The server remains inactive (as if it were a new node) until we trigger a rebalancing event.

As a result, we have to bounce Kafka on an active server every few weeks in response to network blips. 0.8 alleges to handle this better, but we'll see.

Handle-jiggling aside, I'm a fan of Kafka and the types of systems you can build around it. Happy to put you in touch with our Kafka guy, just email me (mike.babineau@rumblegames.com). Loggly's also running Kafka on AWS - would be interesting to hear their take on this.

2 comments

(I work on the software infrastructure team at Loggly -- philip at loggly dot com)

We at Loggly are pretty excited about Kinesis -- I was at the announcement yesterday. We're already big users of Kafka, and we see Kinesis as another potential source and sink for our customers' log data. Our intent is to support pulling streams of log data in real time from our customers' Kinesis infrastructure, if they are already stream data to there -- and then analyze and index it.

And ss a sink, where we can stream customers' log data to their Kinesis infrastructure, after we've analyzed and indexed it for them, just like we do today with AWS S3. It could work really, really well.

Pushing a couple terabytes a day through kafka 0.7. We don't use zookeeper on the producing side and it alleviates this a lot. It's a little more brittle pushing host/partition configs around, but we accepted loss of data in this system and its worth the simplicity of it. Also played with the idea of putting an elb in front.

I'm having way more trouble with the consumer being dumb with the way it distributes topics and partitions. End up with lots of idle consumers, while others are way above max.

Thanks for the note, we'll have to take a look at that sort of configuration.

Your consumer problems sounds similar to one we had. Root cause was that the number of consumers exceeded the number of active partitions. The tricky part was that the topic was only distributed across part of the cluster (because of the issue described in my parent post), so we had fewer partitions than we thought.