|
|
|
|
|
by eclark
1154 days ago
|
|
>But really, the underlying filesystem is doing a lot of heavy lifting. I think that's vastly under selling what's done to ensure that each block is written linearly, blocks are structured, sized, written and accessed in a way that the filesystem does very little (directio, fadvise, droping caches on writes, etc). I was in total agreement with you, for a long time. The rocksdb devs have put in the work, and tuning rocksdb usually gets faster the less the FS does. Lately linear reads and writes are not why one is choosing LSM's in a datacenter setting. Access times of even cheap slow ssd's are amazing. They are used for controlling write amplification with tunable known costs. That is you write fewer hardware blocks to the flash chips with a well tuned rocksdb. |
|
I've worked on my own DB engine that uses a structure similar to LSM (but it's not an LSM tree), where the highest possible performance (millions of TPS) for random-writes mixed with semi-sorted writes mixed with random-reads on current SSDs was the target. There's no need for any data to be allocated sequentially on those, other than just enough aggregation to ensure a sufficiently large block size to reduce IOPS during streaming and merging operations, when IOPS-bound. Indeed it's better to fragment to reuse already filesystem-allocated space where possible - that lowers overhead on the filesystem.
I also agree that a well tuned RocksDB can perform very well, and that the authors have done the work, and that it has methods to reduce avoidable write amplification.
However, the RocksDB applications I've seen haven't use the fancy APIs to get the most out of it. They just used it as a plain k-v store with little understanding of what makes a DB sing, and got not great performance as a result.