Hacker News new | ask | show | jobs
by ahachete 3944 days ago
Yes, I got that, maybe I wasn't clear enough. My main concern is that slow nodes (rather than failing nodes) may easily provoke latency spikes, and that seems to me a quite frequent situation. The good point about quorum writes is that outliers are ignored, but with the ISR, as you need to wait for all of them, outliers are not ignored (until, maybe, removed from the ISR). I understand the advantages and the compromise here. But I would like to see if this is a good compromise, as outliers may have a big impact.
1 comments

Got it, yeah, quorum systems have higher tolerance to tail latency, there is no question about it. we do mention it briefly in the post, but we don't have numbers to show. I'm not aware of it being a major concern for kafka deployments, but I can say that for Apache BookKeeper, we ended up adding the notion of ack quorums to get around such latency spikes. I'll see if I can gather some more information about kafka that we can share at a later time. Thanks for raising this point.
That would be awesome to have some numbers about this topic. Thanks for your interest. I guess with other systems like Paxos it could be solved by separating the notion of learners and acceptors, which are usually collapsed in the same nodes. In this case, you may have more learners than acceptors (solving the N^2 communication growth with the number of nodes) while still solving the tail latency by running quorum among the acceptors.