Hacker News new | ask | show | jobs
by globalreset 1135 days ago
> Issue #1 is that in Kafka’s server.properties file has the line log.flush.interval.messages=1 which forces Kafka to fsync on each message batch. So all tests, even those where this is not configured in the workload file will get this fsync behavior. I have previously blogged about how Kafka uses recovery instead of fsync for safety.

And then in this article it's explained how Kafka is actually unsafe:

> Kafka may handle simultaneous broker crashes but simultaneous power failure is a problem.

just against simultaneous node crashes (whole VM/machine).

I mean - sure in practice running in different AZs, etc. will probably be good enough, but technically...

3 comments

You can't eliminate the risk of data loss, only control for it. fsync is one such control. Empirically, having separate power failure domains strongly controls for the power loss risk.

In the tail there are all kinds of things that will lose you data. I've actually seen systems lose data with the fsync every message strategy on simultaneous power loss. There was latent corruption of the filesystem due to a kernel bug. After power cycling a majority of nodes had unrecoverable filesystems.

In my experience, even on modern flash the cost of fsync is non trivial. It pessimizes io. You can try to account for this with group commit / batching but but generally the batch window needs to be large relative to network rtt to be effective.

fsync is much more necessary on single primary systems.

I only remember losing one etcd cluster, and it was due to something along these lines. Data center at the customer site lost power, and we were called when they couldn't recover our software. All the etcd volumes were corrupted, and after volume recovery by the customer IT department, we found all our etcd files corrupted.

My best guess is their volume systems simply lied about the fsync, which I've heard of a few times about different vendors.

If your workload demands it, then by all means set log.flush.interval.messages=1 or find an alternative solution that is a better match for your requirements.

Kafka has never pretended that ack'd messages have been persisted to disk, only that they've been replicated per your requested acks.

Yep kafka by default is setup to lose data, many people dont know or dont care it seems…
Well, that just isn't accurate really. Kafka would need simulteanous VM failure across all AZs. That just doesn't happen in the real world often enough to worry about. It has never happened in Confluent Cloud. RP have a similar issue. Single AZ deployments with local NVMe drives. AZ loses power, a majority of brokers could lose all their data. Then there's data corruption. Fsyncs alone don't save you. The next step would be to implement Protocol Aware Recovery (https://www.usenix.org/conference/fast18/presentation/alagap...) like TigerBeetle have. Does a system that has implemented anti-corruption in the storage layer now get to lambast Redpanda, Pulsar, ZooKeeper etc because they didn't implement that?
I vouched for this comment (can we please not, folks?). Sure but many people dont run across AZ bc it costs a ton of money. Fsync alone dont save you but it sure makes it less likely to suffer data loss.

> Does a system that has implemented anti-corruption in the storage layer now get to lambast Redpanda, Pulsar, ZooKeeper etc because they didn't implement that?

Sure, why not? I think zk doesn’t do fsync too btw

My gut feeling is that if your only AZ goes down (or all your AZs simultaneously), you're going to lose data period because your producers are now all stuck, your APIs are unavailable, etc. Whether the data loss begins at the exact moment power failed or a couple minutes before doesn't matter, vs. the additional cost to fsync constantly.

I mean it's good to know all the failure modes, but at the end of the day it's also good to know how much handling them will cost, and it's often not worth it.

This is very practical way of looking at the problem and is true for majority of systems, but anyone serious enough about keeping their data, and not just pretending, has some kind of back pressure mechanism built in, so the messages will stop flowing if they can't be processed.
Right, and best case that’s going to come back as 503s or 429s, and if that continues for any length of time your customers are going to view it as morally equivalent to data loss (or maybe worse, if the response has no reason for them to be tied to some event stream).
producers stuck != data loss (if you use transactional commits at least). If you run in multiple regions you dont need multi az in a lot of architectures
I don't mean because of some misfeature in the Kafka protocol, I mean because events are still coming in but have nowhere to go. Unless you built a spill as wide as your Kafka cluster. Which isn't worth it, so no one does it.
Who is running single az deployments who also cares about data loss and availability? Seriously? I’ve personally supported 1000s of kafka deploys and this isn’t a thing in the cloud at least. There is no call for wanting fsync per message, it is an anti pattern and isn’t done because it isn’t necessary. Data loss in kafka isn't a real problem that hurts real world users at all.
I was grabbing beer with a buddy who has ran some large - petabytes per month - Kafka deployments, and his experience was very much that Kafka will lose acked writes if not very carefully configured. He had direct experiences with data loss from JVM GC creating terrible flaky cluster conditions and, separately, from running out of disk on one cluster machine
> There is no call for wanting fsync per message, it is an anti pattern and isn’t done because it isn’t necessary

1. Don't have to do it by message

2. It's used by many distributed db engines, kafka and (i think) zk are the outliers here, not the other way around

Kafka is not a "db engine". zk is a "db engine" in the same way 'DNS' is a "db engine".
I can't list names about the "unserious" people who aren't running multi-AZ, but this is the approach to durability that MongoDB took ~15 years ago and they have never lived it down.

It may just be that data reliability isn't a huge concern for messaging queues, so it's less of an issue, but pretending the risk isn't there doesn't help anyone.

Zookeeper absolutely does fsync, you can't disable it (without libeatmydata). It will log a warning if fsync is too slow as well.
Exactly because we read the documentation and we use it for things where losing data is acceptable.

Just like using HyperLogLog acceptable in many scenarios, using Kafka also acceptable. I am quite baffled how widespread the misuse of technology.

Need reliable data storage? Use a database.

the opposite should be true tho. opt-in for unsafe. you are the minority if you read the docs, let's be real :) most ppl never read the full docs. of the ppl i chat w/ is more like 5%