|
|
|
|
|
by globalreset
1135 days ago
|
|
> Issue #1 is that in Kafka’s server.properties file has the line log.flush.interval.messages=1 which forces Kafka to fsync on each message batch. So all tests, even those where this is not configured in the workload file will get this fsync behavior. I have previously blogged about how Kafka uses recovery instead of fsync for safety. And then in this article it's explained how Kafka is actually unsafe: > Kafka may handle simultaneous broker crashes but simultaneous power failure is a problem. just against simultaneous node crashes (whole VM/machine). I mean - sure in practice running in different AZs, etc. will probably be good enough, but technically... |
|
In the tail there are all kinds of things that will lose you data. I've actually seen systems lose data with the fsync every message strategy on simultaneous power loss. There was latent corruption of the filesystem due to a kernel bug. After power cycling a majority of nodes had unrecoverable filesystems.
In my experience, even on modern flash the cost of fsync is non trivial. It pessimizes io. You can try to account for this with group commit / batching but but generally the batch window needs to be large relative to network rtt to be effective.
fsync is much more necessary on single primary systems.