|
|
|
|
|
by skyde
2227 days ago
|
|
not true for HDFS, Cassandra ,pulsar and most distributed file system. As soon as a segment is under-replicated it”s replication factor is restored under less than 2 minutes by selecting new machine as replica. Kafka try to do it with “kafka cruise control” but adding a replica to the in sync replica list take several hours if partition are 300GB and servers are already busy handling regular live traffic |
|
I'd be curious to hear more about this, because I run several topics with similar partition sizes, and haven't encountered several hours for one replica, and I've routinely shifted 350GB partition replicas as part of routine maintenance.
I have encountered 2 hours to restore a broker that was shut down improperly, but yeah, assuming your replica fetchers aren't throttled to shit, or your brokers aren't overloaded (what's the request handler avg idle? 20% or lower is time to add another broker, 10% is time to add another broker right now), that's really extreme.