| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jlokier 1158 days ago

One thing about LSM trees that are implemented with large numbers of large files in a filesystem, such as RocksDB, is that they defer to the filesystem to deal with fragmentation and block lookup isues. That's not actually free.

LSM tree descriptions typically imply or say outright that each layer is laid out linearly, written sequentially, and read sequentally for merging. And that looking up a block within a layer is an O(1) operation, doing random access I/O to that location.

But really, the underlying filesystem is doing a lot of heavy lifting. It's maintaining the illusion of linear allocation by hiding how the large files are fragmented. That sequential writing is mostly sequential, but typically becomes more fragmented in the filesystem layer as the disk gets closer to full, and over time as various uses of the filesystem mean there are fewer large contiguous regions. More fragmented free space makes the allocation algorithms have to do more work, sometimes more I/O, just to allocate space for the LSM tree's "linear" writes.

Lookup of a block inside a layer requires the filesystem to lookup in its extent tree or, with older filesystems, through indirect block lookups. Those are hidden from the LSM tree database, but are not without overhead.

Writing sequentially to a layer generally requires the filesystem to update its free space structures as well as its extent tree or indirect blocks.

Even a simple operation like the LSM tree database deleting a layer file it has finished with, is not necessarily simple and quick at the filesystem layer.

In other words, when analysing performance, filesystems are the unsung heroes underlying some LSM tree databases. Their algorithmic overhead is often not included in the big-O analysis of LSM tree algorithms running over them, but should be, and their behaviour changes as disk space shrinks and over time due to fragmentation.

4 comments

eclark 1158 days ago

>But really, the underlying filesystem is doing a lot of heavy lifting.

I think that's vastly under selling what's done to ensure that each block is written linearly, blocks are structured, sized, written and accessed in a way that the filesystem does very little (directio, fadvise, droping caches on writes, etc). I was in total agreement with you, for a long time. The rocksdb devs have put in the work, and tuning rocksdb usually gets faster the less the FS does.

Lately linear reads and writes are not why one is choosing LSM's in a datacenter setting. Access times of even cheap slow ssd's are amazing.

They are used for controlling write amplification with tunable known costs. That is you write fewer hardware blocks to the flash chips with a well tuned rocksdb.

jlokier 1158 days ago

I agree that linear reads and writes are not relevant these days in most settings.

I've worked on my own DB engine that uses a structure similar to LSM (but it's not an LSM tree), where the highest possible performance (millions of TPS) for random-writes mixed with semi-sorted writes mixed with random-reads on current SSDs was the target. There's no need for any data to be allocated sequentially on those, other than just enough aggregation to ensure a sufficiently large block size to reduce IOPS during streaming and merging operations, when IOPS-bound. Indeed it's better to fragment to reuse already filesystem-allocated space where possible - that lowers overhead on the filesystem.

I also agree that a well tuned RocksDB can perform very well, and that the authors have done the work, and that it has methods to reduce avoidable write amplification.

However, the RocksDB applications I've seen haven't use the fancy APIs to get the most out of it. They just used it as a plain k-v store with little understanding of what makes a DB sing, and got not great performance as a result.

eclark 1158 days ago

> There's no need for any data to be allocated sequentially

There's two different times that it's critical. The first is for write ahead logs. Those are pre-allocated, opened, and ready in order to stop latency spikes.

The second is to control write amplification on ssd writes. If you write 100K bytes. Then issue a hardware flush. Then write another 100K. That's going to be two different writes to the same ssd. Meaning you'll burn it out double fast. That's a huge no no. Flash burn rate is one of the things that LSM's are good at.

But again things have changed since the first times that LSM's became en vouge. Previously it was critically important to write these files out sequentially because it did mean that large parts of the file were laid out sequentially in HHD tracks. That did work. It was important, and it wasn't the filesystem. It was all about issuing huge linear writes that had to be spilled to disk. There were times that the added linear read speed of reading data sequentially was important (read ahead caching, OLAP style queries, etc)

Now the linear reads and writes are structured around SSD block sizes. So 512k or so. Linear reading speed is bottle necked much more on cpu.

>Indeed it's better to fragment to reuse already filesystem-allocated space where possible - that lowers overhead on the filesystem.

I don't understand what filesystem overhead you're referring to here. A read comes in for a key. The SST files are already pre-opened so there's no FS interaction.

Then you walk the index blocks. Almost always those are in cache from just opening the files. (So I'll hand wave a little here but index blocks are just like data blocks with near 100% hit rates)

From there we have walked the index and know that the key we're looking for could be in a specific block. We have the offset in the file to a block.

That block is located on one single logical hardware unit (hdd sector, ssd block, raid stripe size, etc). So the read that finally goes to the fs layer where we translate one read of a index, offset into a read on a single hardware sector, with page cache off, DirectIO is used to. Read ahead is tuned at the application layer not the FS layer. So that will result in a single read operation being sent to the nvme/ssd.

Essentially the SST file is structured such that the reads use extents as a translation layer only of different address spaces the hardware and the file. That translation layer didn't buy LSM's nearly as much as regular file users. And we had to force everything to align so that each read of a key ends up a single hardware sector read.

Contrast that with using the FS more and relying on hardware structure less. I'll assume that walking the index is always in memory as before. So then we need to read some segment of the file. Issue a read. That read goes to the page cache. The page cache has lots of fun with locking and very strange behavior. We're then left with some un-cached data to read. The larger key/value sizes get the more likely it is that you have extra to read. Eg you only need 100k, but that 100k crosses over the ssd block or raid stripe boundary. Now you are waiting on two operations. Those need to fill the page cache, then return the data to waiting buffers.

pbalcer 1158 days ago

And then there's the SSD controller, which yet again tries to maintain the illusion of efficiently supporting random I/O. For software that tries to get the best possible performance of the physical media, avoiding the OS and controller overhead, it's definitely important to maintain data in sequential blocks. In fact, NVMe ZNS [0], doesn't even allow you to do random writes.

[0] - https://zonedstorage.io/docs/introduction/zns#overview

dietrichepp 1158 days ago

At this point, we are in logs three levels deep—we have a database with a log-structured merge tree, we have a journaling filesystem like ext4, and we have an SSD controller which uses log structures internally for wear leveling.

I recommend the paper, “Don’t stack your Log on my Log”:

https://www.usenix.org/system/files/conference/inflow14/infl...

Basically, the beautiful thing about log structures is that they work well on “dumb” storage. Paradoxically, when you put logs on top of logs, you can make things worse. This is unintuitive, but it reminds me of the fact that if you tunnel a TCP connection over another TCP connection, the result is something which is much less reliable than an ordinary TCP connection operating over an unreliable network.

AtlasBarfed 1157 days ago

Plus cache layers upon cache layers

stingraycharles 1158 days ago

> That is you write fewer hardware blocks to the flash chips with a well tuned rocksdb.

What would you consider a well-tuned rocksdb? My understanding is that, due to level-based compaction, there is always a decent amount of write amplification that is unavoidable -- i.e. for one modification to eventually end up in the bottommost level (e.g. L6), it would need to be (re-)written to disk 5 or 6 times. That's quite heavy an amount of write amplification, but maybe my expectations are off?

eclark 1158 days ago

Tuning RocksDB well is a very very hard challenge, and one that I am happy to not do day to day anymore. RocksDB is very powerful but it comes with other very sharp edges. Compaction is one of those, and all answers are likely workload dependent.

If you are worried about write amplification then leveled compactions are sub-optimal. I would try the universal compaction.

- https://github.com/facebook/rocksdb/wiki/Universal-Compactio...

singron 1158 days ago

All those writes happen in big blocks with other updates, so the number of writes per update is still relatively low. Compare that to e.g. postgres where a single update causes quite a few page writes that probably don't get batched up (add new row version in heap, update old row version in heap, maybe update indexes, vacuum old tuple in heap, maybe vacuum indexes).

jlokier 1157 days ago

A single durable update is the worst case for a classic B-tree with no WAL, though, and where the classic LSM-tree has the advantage. You can't use it to show a consistent write advantage to LSM-trees, except in applications that only do those kinds of updates.

If you write (say) 1000 small, contiguous rows in a single transaction, the B-tree wins. The B-tree writes a few full leaf pages and then updates just enough interior pages to point to them. The LSM-tree writes the rows to the first level, then later will write them again to the second level, then to the third layer, then... N times for N levels. The data also has to be read N-1 times.

At (say) 100,000 contiguous rows in a single transaction, the classic LSM-tree is still poor, but a modern LSM-tree with multiple files in the bulk level, with special handling of bulk loads to bypass the levels and write directly to the logical-middle in the bulk level, is similar I/O performance to the B-tree, so it catches up. I think RocksDB has this but you have to request it explicitly, rather than it detecting when to do so automatically. Probably with lower layout discontiguity if it works by writing to files or large contiguous zones. But if we are comparing LSM-tree with modification for bulk loads, we may as well compare with B-tree with a similar modification, which can also use a zone strategy to reduce discontiguity on bulk loads.

vlovich123 1158 days ago

I agree that DB papers will typically overlook the impact the filesystem has on the database (not just rocksdb - what you wrote is true for everything except something like BlueStore). It’s particularly depressing when you look at how they measure write amplification which tends to ignore things they’re just offloading to the filesystem.

However, I think you’re making a mistake on a core part of your argument:

> More fragmented free space makes the allocation algorithms have to do more work, sometimes more I/O, just to allocate space for the LSM tree's "linear" writes.

The file system in no way needs to guarantee on-disk contiguity for read or write performance, nor does any online defrag need to happen. Indeed, the whole premise behind LSM trees is to try to optimize around solid state storage. AFAIK if the filesystem can only find 1 MiB blocks it will allocate them at the cost of a larger set of extents (there’s also defrag happening). Typically the filesystems do a fantastic job of defrag too. That’s certainly an important part but I’d say those parts of the filesystem are likely the first things implemented and never/rarely changed (just a hunch - I haven’t actually bothered looking at the Linux changelog).

Also no one is really going to care about performance on an almost full filesystem (kind full like 75% but old so lots of fragments is valid but I doubt it’s actually a problem because of how good filesystems are).

jlokier 1158 days ago

> However, I think you’re making a mistake on a core part of your argument:

No, that part has been misunderstood so I guess I didn't write it clearly enough.

I'm not saying the filesystem defragments, or does any particular effort to ensure on-disk contiguous storage. I'm saying that as a result of the presence of other data on the filesystem and historic accumulating entropy in layout (sometimes caused by an LSM tree DB!), the filesystem ends up keeping track of appended data discontiguously on disk. In keeping track, the filesystem has to consult its free-space structures to find new free areas, with a heuristic or two to decide whether to allocate a large or small one, write updates to the free-space structures, and keep track of the resulting discontiguous mapping for the file by writing to those structures too. Even when it's contiguous on-disk, versions of those metadata writes are needed for transactional, durable DB writes. They're simpler during bulk non-durable writes of course.

These are the "more work, sometimes more I/O, just to allocate space".

Good filesystems are efficient, but that activity does add overhead (especially when durable transactions are involved) and big-O algorithmic factors (those 1 MiB extents add up), and the picture painted of linear-time operations in LSM papers is inaccurate as a result. I don't think this overhead is necessarily large in practice most of the time. However it's where much of the complexity and corner cases lie, if honestly analysing the performance range and big-O scaling factors.

You make an interesting point about write amplification at the filesystem layer not being accounted for. In addition, classic LSM trees also have significant write amplification (regardless of filesystem or even block device) due to the simple act of writing all data to every layer. This is well known by LSM designers, and there are mitigations, but it's somehow left out of the public image of LSM trees. Classic LSM trees are excellent for random-access writes that need to be gradually sorted, but for some other write patters, are slower (more I/O) than good quality B-trees and some other structures.

> no one is really going to care about performance on an almost full filesystem (kind full like 75% but old so lots of fragments is valid but I doubt it’s actually a problem because of how good filesystems are).

Heh. In practice, every RocksDB or LevelDB I've seen in production is hundreds of GB or several TB on filesystems that have run low on space or even run out, mainly due to the size of the DB :-) They also have thousands or tens of thousands of files in each DB, which they have to open for random-access queries. This is what motivated me to recognise the filesystem can be quite involved.

vlovich123 1158 days ago

I think we’re mostly on the same page. I’m more approaching it from intrinsic complexity vs not. Could a database do a better job than the FS Managing extents manually? Maybe. But I’m not as sold that the win is significant enough vs other techniques that could be taken. So yes the papers should definitely consider the filesystem and they don’t. I’m just not convinced that the fragmentation issue is a filesystem vs db but more of a block allocation thing that would always be there and probably work very similarly with little in the way of optimization. Certainly I’ve seen people too ready to throw away the filesystem without actually confirming the cost benefit.

l05k 1158 days ago

I would recommend a related paper, "Building an Efficient Key-Value Store in a Flexible Address Space", which focused on the interaction problem between extent-based filesystem and key-value store.

However, in scenarios that really care about performance, memory-mapped files seem to be able to bypass the I/O stack and optimize performance. RocksDB also has corresponding optimizations: https://github.com/facebook/rocksdb/wiki/PlainTable-Format

midom 1158 days ago

this misses multiple points - lots of block access happens via internal caches (hyperclock/lru/etc) and filesystem isn't in the critical path as much as you'd think.

files are mapped out in relatively large chunks, especially compaction outputs - there's prealloaction, and usually you will just have a flat file->block conversion without huge trees or anything.

based on performance profiles filesystem doesn't do any heavy lifting, there's not that much fragmentation (and you usually keep some free space for flash GC anyway),

compaction output write is one logical operation on filesystem for tens of megabytes of data.

> filesystems are the unsung heroes underlying some LSM tree databases

meh