Hacker News new | ask | show | jobs
by ahachete 3948 days ago
Very insightful post. My main concern with this approach is that having the ISR a fixed size (at any given time), as the post explains, it is subject to latency spikes if a node of the ISR is slow. This seems to me it would happen a lot in real cases, and its a major drawback.

So while providing a lot of flexibility and decreasing the communication overhead when using a lot of nodes, it may introduce performance degradation and latency spikes when nodes are slow. I'd probably prefer a design where some kind of reduced majority is required, but not a full list of nodes, so that outliers would be filtered out.

Are there good percentile numbers over there to check metadata writes over the, say, 99,99%-99,999% percentile, to check how bad this may become?

1 comments

I'm happy you enjoyed the post and thanks for the comment.

If I understand your comment correctly, it isn't the case that the ISR is fixed. The ISR has a minimum size, but it can change over time, so brokers can be removed from the ISR and they can rejoin later once they catch up.

Yes, I got that, maybe I wasn't clear enough. My main concern is that slow nodes (rather than failing nodes) may easily provoke latency spikes, and that seems to me a quite frequent situation. The good point about quorum writes is that outliers are ignored, but with the ISR, as you need to wait for all of them, outliers are not ignored (until, maybe, removed from the ISR). I understand the advantages and the compromise here. But I would like to see if this is a good compromise, as outliers may have a big impact.
Got it, yeah, quorum systems have higher tolerance to tail latency, there is no question about it. we do mention it briefly in the post, but we don't have numbers to show. I'm not aware of it being a major concern for kafka deployments, but I can say that for Apache BookKeeper, we ended up adding the notion of ack quorums to get around such latency spikes. I'll see if I can gather some more information about kafka that we can share at a later time. Thanks for raising this point.
That would be awesome to have some numbers about this topic. Thanks for your interest. I guess with other systems like Paxos it could be solved by separating the notion of learners and acceptors, which are usually collapsed in the same nodes. In this case, you may have more learners than acceptors (solving the N^2 communication growth with the number of nodes) while still solving the tail latency by running quorum among the acceptors.
I think a further interesting observation is that one can actually trade time uncertainty (i.e. latency spikes) for location uncertainty.

One can drive latency almost arbitrarily low if one is willing to give up exact knowledge of a where the data is.

A simple example is N out of M writes (N and M can be arbitrary with N <= M). M is a set of machines, N is number of replicas we want to have. Now at any given time we write to the N machines that respond fastest. Data is now arbitrarily sprayed over the M machines, but as long as the data itself has ordering information the right state can be reassembled by talking to the M machines.

The "spray" can now be controlled by favouring some nodes from M up to a timeout (i.e. we put more uncertainty in time). We can reduce the reassembly work by using learners and hence increase the likelihood that one machine has all state.

yeah, this is precisely what we do in Apache BookKeeper, and we map N to ack quorums and M to write quorums.