Hacker News new | ask | show | jobs
by EdwardDiego 2223 days ago
Yes, but this is true of any system offering N - 1 safety, e.g., HDFS, Vertica, Pulsar. It's not specific to Kafka.

And you can switch to your warm replicated cluster in this scenario, if you have one, Mirror Maker 2 supports replicated offsets so consumers can switch without losing state.

But what you're describing is going to shaft any replicated system.

1 comments

not true for HDFS, Cassandra ,pulsar and most distributed file system.

As soon as a segment is under-replicated it”s replication factor is restored under less than 2 minutes by selecting new machine as replica.

Kafka try to do it with “kafka cruise control” but adding a replica to the in sync replica list take several hours if partition are 300GB and servers are already busy handling regular live traffic

> adding a replica to the in sync replica list take several hours if partition are 300GB

I'd be curious to hear more about this, because I run several topics with similar partition sizes, and haven't encountered several hours for one replica, and I've routinely shifted 350GB partition replicas as part of routine maintenance.

I have encountered 2 hours to restore a broker that was shut down improperly, but yeah, assuming your replica fetchers aren't throttled to shit, or your brokers aren't overloaded (what's the request handler avg idle? 20% or lower is time to add another broker, 10% is time to add another broker right now), that's really extreme.