Hacker News new | ask | show | jobs
by otterley 1490 days ago
I’m a little confused. If you don’t ensure data is committed to storage (log or otherwise) before acking the write request, how can you call it durable?

If it’s not truly 100% durable by default, it’s best not to suggest that it is. Experience says people will use the default settings and then become very cross if they lose data. It undermines trust and is harmful to reputation.

1 comments

With many workloads, there's a tradeoff between the granularity of durability and the overall performance.

If a workload has many small writes (some of our product workloads do), then syncing each write can cause write amplification and massively affect overall throughput and latency. Suppose I do a 100B write, this causes a 4KiB page write to sync, which is 40x write amp. Suddenly a 2GiB/sec SSD can effectively only write 50MiB/sec. Similarly, the per-write latency goes from <5us to 10us (with the fastest Optane SSDs) or 150us (with flash SSDs).

So storage systems tend to offer a range of durability guarantees. Some systems have a special sync operation for applications to ensure that all writes are durable.

RocksDB offers a fairly weak guarantee by default too, writing to the write-ahead-log (WAL), but not performing fsyncs (https://github.com/facebook/rocksdb/wiki/WAL-Performance). They make a similar write amplification argument too (https://github.com/facebook/rocksdb/wiki/WAL-Performance#wri...).

You’re absolutely correct about those facts, but you’re also avoiding the thrust of my argument about improperly calling your database durable when it is decidedly not and could fail a trivial power-cut test. A database’s one job is not to lose data.

I respectfully call on you to rescind that word in your documentation for cases when it is not activated, including the default configuration. If this is the default to help the database’s reported benchmark performance, falsely implying it’s durable is simply cheating. And if the hardware has limitations that impact performance, c’est la vie. All storage hardware does.

The fact that RocksDB does this makes any claims of durability it makes equally specious. And as we were taught as schoolchildren, two wrongs do not make a right. RocksDB needs to address this too, to the extent it makes or implies any false or misleading durability claims.

Transaction merging allows you to handle that nicely. By handling concurrent writes and merging them into a single write to the disk.