| I think their claims of write amplification reduction are a bit overstated given more realistic workloads. It is true that b-trees aren't ideal in that respect, and you will see some amount of write amplification, but not enough that it should be a major consideration, in my experience You really have to take into account workingset size and cache size to make any judgements there; your b-tree writes should be given by journal/WAL reclaim, which will buffer up updates. A purely random update workload will kill a conventional b-tree on write amplification - like I mentioned, that's the absolute worst case scenario for a b-tree. But it just doesn't happen in the real world. For the data I can give you, that would be bcachefs's hybrid b-tree - large btree nodes (256k, typically) which are internally log structured; I would consider it a minor variation on a classical b-tree. The log structuring mean that we can incrementally write only the dirty keys in a node, at the cost of some compaction overhead (drastically less than a conventional LSM). In actual real world usage, when I've looked at the numbers (not recently, so this may have changed) we're always able to do giant highly efficient b-tree writes - the journal and in-memory cache are batching things up as much as we want - which means write amplification is negligible. |
If you're worried about the last layer being a giant unmanageably large B+-Tree, just shard it similarly in key space to not need much free temporary working space on SSD to stream the freshly compacted data to while the inputs to the compaction still serve real time queries.