|
|
|
|
|
by thomaslee
3566 days ago
|
|
I used to be on a team responsible for a single small-ish Kafka cluster (between 6-12 nodes) doing non-trivial throughput on bare metal. Without commenting on whether ZeroMQ is the right alternative: I can understand being scared off. Our hand was forced such that we had to go the other way and understand what was going on in Kafka. The kicker is that Kafka can be rock solid in terms of handling massive throughput and reliability when the wheels are well greased, but there are a lot of largely undocumented lessons to learn along the way RE: configuration and certain surprising behavior that can arise at scale (such as https://issues.apache.org/jira/browse/KAFKA-2063, which our team ran into maybe a year ago & is only being fixed now). Symptoms of these issues can cause additional knock-on effects with respect to things like leader election (we wound up with a "zombie leader" in our cluster that caused all sorts of bizarre problems) and graceful shutdowns. Add to that the fact the software is still very much under active development (sporadic partition replica drops after an upgrade from 0.8.1 to 0.8.2; we had to apply some small but crucial patches from Uber's fork) & that it needs a certain level of operational maturity to monitor it all ... it's easy to get nervous about what the next "surprise" will be. Having said all that, I'd use Kafka again in a heartbeat for those high volume use cases where reliability matters. Not sure I'd advise others without similar operational experience to do the same for anything mission critical, though -- unless you like stress. That stress is why Confluent is in business. :) |
|
Kafka and Confluent Platform are very much still works in progress. I had to patch Kafka Connect HDFS connector because a fix I needed was left out of the last release. Be prepared to do something similar with any of Kafka's components.