|
|
|
|
|
by otterley
1490 days ago
|
|
I’m a little confused. If you don’t ensure data is committed to storage (log or otherwise) before acking the write request, how can you call it durable? If it’s not truly 100% durable by default, it’s best not to suggest that it is. Experience says people will use the default settings and then become very cross if they lose data. It undermines trust and is harmful to reputation. |
|
If a workload has many small writes (some of our product workloads do), then syncing each write can cause write amplification and massively affect overall throughput and latency. Suppose I do a 100B write, this causes a 4KiB page write to sync, which is 40x write amp. Suddenly a 2GiB/sec SSD can effectively only write 50MiB/sec. Similarly, the per-write latency goes from <5us to 10us (with the fastest Optane SSDs) or 150us (with flash SSDs).
So storage systems tend to offer a range of durability guarantees. Some systems have a special sync operation for applications to ensure that all writes are durable.
RocksDB offers a fairly weak guarantee by default too, writing to the write-ahead-log (WAL), but not performing fsyncs (https://github.com/facebook/rocksdb/wiki/WAL-Performance). They make a similar write amplification argument too (https://github.com/facebook/rocksdb/wiki/WAL-Performance#wri...).