| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ajhconway 1480 days ago

With many workloads, there's a tradeoff between the granularity of durability and the overall performance.

If a workload has many small writes (some of our product workloads do), then syncing each write can cause write amplification and massively affect overall throughput and latency. Suppose I do a 100B write, this causes a 4KiB page write to sync, which is 40x write amp. Suddenly a 2GiB/sec SSD can effectively only write 50MiB/sec. Similarly, the per-write latency goes from <5us to 10us (with the fastest Optane SSDs) or 150us (with flash SSDs).

So storage systems tend to offer a range of durability guarantees. Some systems have a special sync operation for applications to ensure that all writes are durable.

RocksDB offers a fairly weak guarantee by default too, writing to the write-ahead-log (WAL), but not performing fsyncs (https://github.com/facebook/rocksdb/wiki/WAL-Performance). They make a similar write amplification argument too (https://github.com/facebook/rocksdb/wiki/WAL-Performance#wri...).

2 comments

otterley 1480 days ago

You’re absolutely correct about those facts, but you’re also avoiding the thrust of my argument about improperly calling your database durable when it is decidedly not and could fail a trivial power-cut test. A database’s one job is not to lose data.

I respectfully call on you to rescind that word in your documentation for cases when it is not activated, including the default configuration. If this is the default to help the database’s reported benchmark performance, falsely implying it’s durable is simply cheating. And if the hardware has limitations that impact performance, c’est la vie. All storage hardware does.

The fact that RocksDB does this makes any claims of durability it makes equally specious. And as we were taught as schoolchildren, two wrongs do not make a right. RocksDB needs to address this too, to the extent it makes or implies any false or misleading durability claims.

ayende 1480 days ago

Transaction merging allows you to handle that nicely. By handling concurrent writes and merging them into a single write to the disk.