Hacker News new | ask | show | jobs
by ajhconway 1490 days ago
Hi, research lead for SplinterDB here.

SplinterDB does make all writes durable and in fact has its own user-level cache which generally performs writes directly to disk (using O_DIRECT for example).

Like RocksDB's default behavior (no fsyncs on the log), it does not immediately sync writes to its log when they happen. It waits to sync in batches, so that writes may not be immediately durable, but logging is more efficient. This is a slightly stronger default durability guarantee, and we intend to make this configurable.

2 comments

I’m a little confused. If you don’t ensure data is committed to storage (log or otherwise) before acking the write request, how can you call it durable?

If it’s not truly 100% durable by default, it’s best not to suggest that it is. Experience says people will use the default settings and then become very cross if they lose data. It undermines trust and is harmful to reputation.

With many workloads, there's a tradeoff between the granularity of durability and the overall performance.

If a workload has many small writes (some of our product workloads do), then syncing each write can cause write amplification and massively affect overall throughput and latency. Suppose I do a 100B write, this causes a 4KiB page write to sync, which is 40x write amp. Suddenly a 2GiB/sec SSD can effectively only write 50MiB/sec. Similarly, the per-write latency goes from <5us to 10us (with the fastest Optane SSDs) or 150us (with flash SSDs).

So storage systems tend to offer a range of durability guarantees. Some systems have a special sync operation for applications to ensure that all writes are durable.

RocksDB offers a fairly weak guarantee by default too, writing to the write-ahead-log (WAL), but not performing fsyncs (https://github.com/facebook/rocksdb/wiki/WAL-Performance). They make a similar write amplification argument too (https://github.com/facebook/rocksdb/wiki/WAL-Performance#wri...).

You’re absolutely correct about those facts, but you’re also avoiding the thrust of my argument about improperly calling your database durable when it is decidedly not and could fail a trivial power-cut test. A database’s one job is not to lose data.

I respectfully call on you to rescind that word in your documentation for cases when it is not activated, including the default configuration. If this is the default to help the database’s reported benchmark performance, falsely implying it’s durable is simply cheating. And if the hardware has limitations that impact performance, c’est la vie. All storage hardware does.

The fact that RocksDB does this makes any claims of durability it makes equally specious. And as we were taught as schoolchildren, two wrongs do not make a right. RocksDB needs to address this too, to the extent it makes or implies any false or misleading durability claims.

Transaction merging allows you to handle that nicely. By handling concurrent writes and merging them into a single write to the disk.
I missed the use of direct io and the comment about fsync threw me off, thanks. Very impressive then!
O_DIRECT doesn't provide power-cut durability on storage devices with write cache.

Recently written and acknowledged data can still be lost on a power cut.

You still need fsync, fdatasync or equivalent after an O_DIRECT write, to tell the storage device to commit its write cache to the non-volatile layer.

(And last time I looked, I think some filesystems even incorrectly failed to flush the device write cache on fsync after O_DIRECT writes because of no dirty page states.)

There’s a ton devices on the market that would lie to you too saying caches are flushed while they aint. If you really want that data to be there better use “server grade” hw with power loss protection