| One thing about LSM trees that are implemented with large numbers of large files in a filesystem, such as RocksDB, is that they defer to the filesystem to deal with fragmentation and block lookup isues. That's not actually free. LSM tree descriptions typically imply or say outright that each layer is laid out linearly, written sequentially, and read sequentally for merging. And that looking up a block within a layer is an O(1) operation, doing random access I/O to that location. But really, the underlying filesystem is doing a lot of heavy lifting. It's maintaining the illusion of linear allocation by hiding how the large files are fragmented. That sequential writing is mostly sequential, but typically becomes more fragmented in the filesystem layer as the disk gets closer to full, and over time as various uses of the filesystem mean there are fewer large contiguous regions. More fragmented free space makes the allocation algorithms have to do more work, sometimes more I/O, just to allocate space for the LSM tree's "linear" writes. Lookup of a block inside a layer requires the filesystem to lookup in its extent tree or, with older filesystems, through indirect block lookups. Those are hidden from the LSM tree database, but are not without overhead. Writing sequentially to a layer generally requires the filesystem to update its free space structures as well as its extent tree or indirect blocks. Even a simple operation like the LSM tree database deleting a layer file it has finished with, is not necessarily simple and quick at the filesystem layer. In other words, when analysing performance, filesystems are the unsung heroes underlying some LSM tree databases. Their algorithmic overhead is often not included in the big-O analysis of LSM tree algorithms running over them, but should be, and their behaviour changes as disk space shrinks and over time due to fragmentation. |
I think that's vastly under selling what's done to ensure that each block is written linearly, blocks are structured, sized, written and accessed in a way that the filesystem does very little (directio, fadvise, droping caches on writes, etc). I was in total agreement with you, for a long time. The rocksdb devs have put in the work, and tuning rocksdb usually gets faster the less the FS does.
Lately linear reads and writes are not why one is choosing LSM's in a datacenter setting. Access times of even cheap slow ssd's are amazing.
They are used for controlling write amplification with tunable known costs. That is you write fewer hardware blocks to the flash chips with a well tuned rocksdb.