|
|
|
|
|
by willvarfar
1149 days ago
|
|
A bit of a tangent, but HNers often have the kind of hands-on experience that's hard to find in internet searches, so I'll ask away :) A long time ago we had a big MySQL tokudb db and were keen to migrate to myrocks. But myrocks put every table into a single big file, rather than a file per partition. The partition-per-file is a big deal if you are retaining N days of data in a DB and every night will be dropping some old day. If your DB stores each partition in separate files, the DB can simply delete them. But if your DB stores all the partitions in a single file, then it will end up having to compact your absolutely massive big dataset. It was completely unworkable for us. Has this changed? |
|
Partitioning data across files (or LSM trees) can be a remarkable win. For data retention policies, as well as for exploiting immutability in different workloads to reduce write amplification.
For example, in TigerBeetle, a DB that provides double-entry financial accounting primitives, our secondary indexes mutate, but half of our ingest volume, all the transactions themselves are immutable, and inserted in chronological order.
We therefore designed our local storage engine as an LSM-forest, putting different key/value types in their own tree, so that mutable data wouldn't compact immutable data. This turns our object tree for primary keys into essentially an append-only log.
I did a lightning talk on this, and a few of our other LSM optimizations, at Jamie Brandon's HYTRADBOI conference last year: https://www.youtube.com/watch?v=yBBpUMR8dHw
RocksDB also allows you to do this, with its concept of column families, if I am not mistaken. However, we wanted more memory efficiency with static memory allocation, deterministic execution and deterministic on disk storage for faster testing (think FoundationDB's simulator but with storage fault injection) and faster recovery (thanks to smaller diffs, with less randomness in the data files being recovered), and also an engine that could solve our storage fault model.
All details in the talk. Or ping me if you have questions.