Hacker News new | ask | show | jobs
by jvanlightly 1135 days ago
It's a common misconception about Kafka and fsyncs. But the Kafka replication protocol has a recovery mechanism, much in the same way that Viewstamped Replication Revisited does (except it's safer due to the page cache), which allows Kafka to write to disk asynchronously. The trade-off is that we need fault domains (AZs in the cloud), but if we care about durability and availability, we should be deploying across AZs anyway. We've seen plenty of full region outages, but zero power loss events in multiple AZs in six years.

Kafka and fsyncs: https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-...

3 comments

As far as read the blog post, I understand that it assumes the scenario that "a replica dies (and loses its log prefix due to no fcyns) and came back instantaneously (before another replica catches up to the leader)".

Then, in Kafka, what if the leader dies with power failure and came back instantaneously?

i.e.: Let's say there are 3 replicas A(L), B(F), C(F) (L = leader, F = follower)

- 1) append a message to A

- 2) B, C replicas the message. The message is committed

- 3) A dies and came back instantaneously before zk.session.timeout elapsed (i.e. no leadership failover happens), with losing its log prefix due to no fsync

Then B, C truncates the log and the committed message could be lost? Or is there any additional safety mechanism for this scenario?

I love this question. Would be great to hear back from Confluent about this.

One safety mechanism I can think of is that the replicas will detect the leader is down and trigger leader election themselves. Or that upon restart the leader realized it restarted and triggers leader election in a way that B ends up as the leader. (not sure either is being done)

As I think about it more, even if there’s a solution I think I’ll stick to running Redpanda or running Kafka with fsync.

The solution seems to be fsync. It’s what it’s for. It’s very appealing to wave it away because it’s expensive.

The situation above may be just one example of data loss, but it seems there could be others when we gamble on hoping servers restart quickly enough, and don’t crash at the same time, etc.

Two comments here.

1) What about Kafka + KRaft, doesn't that suffer the same problem you point out in Redpanda? If so, recommending to your customers to run KRaft without fsync would be like recommending running with a Zookeeper that sometimes doesn't work. Or do I fundamentally misunderstand KRaft?

2) You mention simultaneous power failures deep into the fsync blog post. I think this should be more visible in your benchmark blog post, when you write about turnning off fsyncs.

1) KRaft is only for metadata replication, and data replication is done in ISR based even in KRaft, so it doesn't change the conclusion
Nah, I dug deeper into this. Right conclusion, wrong reasoning.

The reason KRaft turns out to be fine is because the KRaft topic does fsync! Source: https://cwiki.apache.org/confluence/display/KAFKA/KIP-595%3A...

KRaft is used for metadata replication in the same way that Zookeeper is used for metadata. I.e., in a very meaningful way.

repeating things does not make them true. I read the post. You can only control some failures, but happy for us to write our thoughts in blog form.