|
|
|
|
|
by ajhconway
1480 days ago
|
|
With many workloads, there's a tradeoff between the granularity of durability and the overall performance. If a workload has many small writes (some of our product workloads do), then syncing each write can cause write amplification and massively affect overall throughput and latency. Suppose I do a 100B write, this causes a 4KiB page write to sync, which is 40x write amp. Suddenly a 2GiB/sec SSD can effectively only write 50MiB/sec. Similarly, the per-write latency goes from <5us to 10us (with the fastest Optane SSDs) or 150us (with flash SSDs). So storage systems tend to offer a range of durability guarantees. Some systems have a special sync operation for applications to ensure that all writes are durable. RocksDB offers a fairly weak guarantee by default too, writing to the write-ahead-log (WAL), but not performing fsyncs (https://github.com/facebook/rocksdb/wiki/WAL-Performance). They make a similar write amplification argument too (https://github.com/facebook/rocksdb/wiki/WAL-Performance#wri...). |
|
I respectfully call on you to rescind that word in your documentation for cases when it is not activated, including the default configuration. If this is the default to help the database’s reported benchmark performance, falsely implying it’s durable is simply cheating. And if the hardware has limitations that impact performance, c’est la vie. All storage hardware does.
The fact that RocksDB does this makes any claims of durability it makes equally specious. And as we were taught as schoolchildren, two wrongs do not make a right. RocksDB needs to address this too, to the extent it makes or implies any false or misleading durability claims.