Introduction to LSM Trees: May the Logs Be with You

Y	Hacker News new \| ask \| show \| jobs

	Introduction to LSM Trees: May the Logs Be with You (priyankvex.wordpress.com)
	128 points by priyankvex 2608 days ago

3 comments

DocSavage 2608 days ago

I found this older introduction to be pretty good and part of a series: https://medium.com/databasss/on-disk-io-part-3-lsm-trees-8b2...

Besides the use of LSM trees in RocksDB and leveldb-like databases, there is also the WiscKey approach (https://www.usenix.org/node/194425) that helps read/write amplification by keeping the LSM tree small and mostly removing the values to a separate value log. There's a pure Go implementation of the WiscKey approach used by dgraph: https://github.com/dgraph-io/badger

link

alexott 2608 days ago

He (Alex Petrov) is also writing a book for O'Reilly on the database internals: https://www.goodreads.com/book/show/44647144-database-intern...

link

dominotw 2608 days ago

weird o'reilly doesn't list that on their own website.

link

clumsysmurf 2608 days ago

Seems like OReilly really goes out of their way to hide ALL books these days. When you go to the main site (https://www.oreilly.com/), where do you see anything related to books? All I see is "online learning", "blended courses", "conferences" and "ideas".

I'm a bit upset by this, because I've found the Safari experience terrible.

link

indogooner 2608 days ago

I found the build up to this through flavors of disk IO in parts 1 and 2 to be very useful to me. Basic stuff but because I had not done system programming since my college days it was a good refresher (and eye-opener).

link

lichtenberger 2608 days ago

If you just need to fetch values by a key, for the main storage (might be even as simple as generated by a sequence generator) you can even avoid the asynchronous background compaction overhead and thus unpredictable read- or write-peaks and so on by hashing the keys if it's not already an integer/long based identifier: Basically storing a persistent (both on-disk persistence as well as in the functional sense immutable) hash array based trie. This can easily be extended to store a new revision through copy-on-write semantics. Instead of storing whole page snapshots however, storage advanced now permit fine granular access to your data. Thus you can basically apply lessons learned from backup systems to version the data pages itself and even improve on that.

Disclaimer: I'm the author of a recent article about a free open source storage system I'm maintaining, which versions data at it's very core: "Why Copy-on-Write Semantics and Node-Level-Versioning are key to Efficient Snapshots": https://hackernoon.com/sirix-io-why-copy-on-write-semantics-...

link

zzzcpan 2608 days ago

You don't actually need to do asynchronous background compaction at all. You can do compaction whenever in small incremental steps not causing any spikes in read or write latencies. Just spreading it across all writes gets you slightly slower, but latency capped writes. It's unfortunate that LevelDB popularized this compaction in a thread idea. It's pretty bad one.

link

lichtenberger 2608 days ago

Good catch :-) right, but still merging/compaction work has to be done. Maybe too much, if you just need to fetch a value by its key and thus just an equality scan is needed (no range scans or other comparisons). For the latter case I've implemented an AVL-tree, which is also versioned and stored in our record pages and best read fully in-memory (but doesn't have to). For sure there are plenty of optimizations and for instance also spatio-temporal indexes or full-text indexes possible, but I guess first looking into cost-based rewrite rules for the query compiler and replication/partitioning for horizontal scaling. Too many ideas I guess ;-) but the best would be to have a great open source community :-)

link

dominotw 2608 days ago

would you know why LevelDB choose compaction in thread vs the method you are describing.

link

eeZah7Ux 2608 days ago

Any though on https://en.wikipedia.org/wiki/NILFS ?

link

lichtenberger 2608 days ago

Haven't heard of, but the storage system is heavily inspired by ZFS and by putting some of the ideas (plus adding our own obviously ;)) to the sub-file level: https://kops.uni-konstanz.de/bitstream/handle/123456789/2769...

link

eeZah7Ux 2608 days ago

I'd like to see a comparison with using mmap and letting the kernel do the paging.

link

lichtenberger 2608 days ago

Basically, I'd like to provide the I/O layer with memory mapped file regions, such that it's simply a configuration option if you use the RandomAccessFile implementation or maybe one based on memory mapped files. I think for the JVM there's Chronicle. Thanks so much for asking :-) and by the way any contribution would be the best I can hope for

link

webshit155 2607 days ago

I don't understand the hash index part. I guess that for every segment on disk, you also have a hash table for it, correct? Also, this part doesn't seem to be very memory-efficient

link