| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rakoo 3899 days ago

A LSM tree is actually a good idea if you think about it.

The R/W patterns for a message queue are simple:

- Messages are key/value

- key is an autoincrementing id

- Writes are at the end, Reads are from the beginning

- Once a message is processed, it's deleted

So in practice this means that the items are written in an append-only fashion, get merged in bigger chunks, and then get progressively deleted. So at higher levels you don't see the huge latencies due to compaction because all records are deleted. Knowing that keys are only incrementing could also lead to a simple optimization: the compaction phase can be a simple concatenation of files.

So you get an append-only system that progressively removes older entries as they are deleted without resorting to mad science hackery [1]. Why didn't it work for InfluxDB ? All I can guess is that individual entries for each series are all mixed together (InfluxDB wants to be able to manage many series with many tags) and older entries are not deleted as frantically, so you get the latencies we all know with compaction and unpredictable reads.

Now, this is purely theoretical and of course further experimentations are needed to make sure this is correct, but LSM is in my opinion a correct pattern here.

[1] https://gist.github.com/CAFxX/571a1558db9a7b393579

1 comments

hyc_symas 3898 days ago

A queue is the correct pattern for a queue. A tree, of any form, offers no advantage.

The InfluxDB experience is definitely illuminating. Their problems with LMDB were mainly due to misuse of the API. https://disqus.com/home/discussion/influxdb/benchmarking_lev...

For batched sequential writes, there is no other DB anywhere near as fast as LMDB http://symas.com/mdb/microbench/ (Section E, Batched Writes)

But even so - the reason LMDB can do this so quickly is because for batched sequential writes it cheats - it's just performing Appends, there's no complicated tree construction/balancing/splitting of any kind going on.

If you know that your workload will only be producer/consumer, with sequentially generated data that is sequentially consumed, it's a stupid waste of time to mess with any other structure than a pure linear queue. (Or a circular queue, when you know the upper bounds of how much data is outstanding.)

As for your initial statement - no, an LSM tree is not a correct pattern here. If your consumers are actually running as fast (or faster) than your producer then it should never flush from Level0/memory to Level1/disk. In that case all you've got is an in-memory queue that evaporates on a system crash.

If your consumers are running slower, that means data is accumulating in the DB, which means you will have compaction delays. And the compaction delays will only get slower over time, as more and more levels need to be merged. (Remember that merge operations are O(N). Then remember that there are N of them to do. O(N^2) is a horrible algorithmic complexity.) LSM is never a correct pattern.

link

leif 3894 days ago

> In that case all you've got is an in-memory queue that evaporates on a system crash.

https://www.cs.berkeley.edu/~brewer/cs262/Aries.pdf

> Remember that merge operations are O(N). Then remember that there are N of them to do. O(N^2) is a horrible algorithmic complexity.

No. Mountains of actual math refute this. LSM-tree merges are O(N log N). This is an Actual Fact.

Read more, kids.

link

hyc_symas 3894 days ago

Ah yes, you're absolutely right. O(N log N) because there are log N chunks to be merged.

O(N log N) is still untenable in the long run, nobody has exponentially growing compute resources.

link

databass 3894 days ago

May I also mention that N log N is the total cost of compaction for a DB of size N. You don't perform a compaction on every single write. Amortised per write the cost is more like N log(N)/N == log(N).

Also, N log(N) is nowhere near exponential. O(2^N) would be exponential, and that's not what you have here.

link

hyc_symas 3894 days ago

"Amortised per write" - now you're getting down into the constant factors, which Big-O disregards. But you can't ignore them in real implementations. First the actual writes have a 2x constant factor, since you're writing to a WAL in addition to the DB itself.

The original LSM paper claims that writes to Level 0 are free because that's all in-memory. But that's not really true; if you have a stream of incoming writes then everything that goes into Level 0 must eventually be pushed out to Level 1. Buffering doesn't make writes free, it only displaces their occurrence in time.

So you have a rolling merge every M writes. As far as Big-O goes, that's N log(N) / M == N log(N) because Big-O disregards constant factors!

In the context of an implementation like LevelDB, theory and reality diverge even further. Since it's chunking data into 2MB files and deleting them during each merge operation, and also writing a bunch of bookkeeping into Manifest files and other stuff, the number of actual I/Os is much higher. A lot of wasted activity in allocating and deallocating space - filesystem metadata overhead that's also not transactionally safe.

In LevelDB a single merge reads 26MB and writes 26MB at a time to push 2MB of data from a level L to level L+1. So now instead of a single merge op costing only N, it actually costs 13*N. Again, if you're only talking about Big-O complexity you sweep this under the rug. But in reality, this is a huge cost.

link

hyc_symas 3894 days ago

Stated another way - assume you want to sustain a user workload writing 20MB/sec, and you don't do any throttling. Level 0 consists of 4 1MB files - it will fill in 1/5th of a second, and then compaction will reduce it by 1MB. After that it will be compacting continuously every 1/20th of a second. To sustain this workload for the 1st second will thus require 17 compactions to Level 1. Assuming an already populated Level 1 and worst-case key distribution that means in 1 second it will trigger compactions that read 238MB and write 238MB to store the incoming 20MB.

Level 1 is only 10MB, so if it was empty it would fill in the first 1/2 second. For the remaining 1/2 second it would trigger 5 more compactions to Level 2, reading 130MB and writing 130MB. If it started out full then this would be 260MB/260MB respectively.

So for a 20MB/sec input workload you would need a disk subsystem capable of sustaining 498MB/sec of reads concurrent with 498MB/sec of writes. And that's only for a small DB, only Level 0-2 present (smaller than 110MB), and excluding the actual cost of filesystem operations (create/delete/etc).

That's only for the 1st second of load. For every second after that, you're dumping from Level 0 to Level 1 at 280MB read and 280MB write/sec. And dumping from Level 1 to Level 2 at 260/260 as before. 540/540 - so a disk capable of 1080MB/sec I/O is needed to sustain a 20MB/sec workload. And this is supposed to be HDD-optimized? Write-optimized? O(N logN) - what a laugh.

Maybe LSMs in general can be more efficient than this. LevelDB is pretty horrible though.

link

hyc_symas 3894 days ago

For reference - http://leveldb.googlecode.com/svn/trunk/doc/impl.html

link