Besides the use of LSM trees in RocksDB and leveldb-like databases, there is also the WiscKey approach (https://www.usenix.org/node/194425) that helps read/write amplification by keeping the LSM tree small and mostly removing the values to a separate value log. There's a pure Go implementation of the WiscKey approach used by dgraph: https://github.com/dgraph-io/badger
Seems like OReilly really goes out of their way to hide ALL books these days. When you go to the main site (https://www.oreilly.com/), where do you see anything related to books? All I see is "online learning", "blended courses", "conferences" and "ideas".
I'm a bit upset by this, because I've found the Safari experience terrible.
I found the build up to this through flavors of disk IO in parts 1 and 2 to be very useful to me. Basic stuff but because I had not done system programming since my college days it was a good refresher (and eye-opener).
If you just need to fetch values by a key, for the main storage (might be even as simple as generated by a sequence generator) you can even avoid the asynchronous background compaction overhead and thus unpredictable read- or write-peaks and so on by hashing the keys if it's not already an integer/long based identifier: Basically storing a persistent (both on-disk persistence as well as in the functional sense immutable) hash array based trie. This can easily be extended to store a new revision through copy-on-write semantics. Instead of storing whole page snapshots however, storage advanced now permit fine granular access to your data. Thus you can basically apply lessons learned from backup systems to version the data pages itself and even improve on that.
Disclaimer: I'm the author of a recent article about a free open source storage system I'm maintaining, which versions data at it's very core: "Why Copy-on-Write Semantics and Node-Level-Versioning are key to Efficient Snapshots": https://hackernoon.com/sirix-io-why-copy-on-write-semantics-...
You don't actually need to do asynchronous background compaction at all. You can do compaction whenever in small incremental steps not causing any spikes in read or write latencies. Just spreading it across all writes gets you slightly slower, but latency capped writes. It's unfortunate that LevelDB popularized this compaction in a thread idea. It's pretty bad one.
Good catch :-) right, but still merging/compaction work has to be done. Maybe too much, if you just need to fetch a value by its key and thus just an equality scan is needed (no range scans or other comparisons). For the latter case I've implemented an AVL-tree, which is also versioned and stored in our record pages and best read fully in-memory (but doesn't have to). For sure there are plenty of optimizations and for instance also spatio-temporal indexes or full-text indexes possible, but I guess first looking into cost-based rewrite rules for the query compiler and replication/partitioning for horizontal scaling. Too many ideas I guess ;-) but the best would be to have a great open source community :-)
Basically, I'd like to provide the I/O layer with memory mapped file regions, such that it's simply a configuration option if you use the RandomAccessFile implementation or maybe one based on memory mapped files. I think for the JVM there's Chronicle. Thanks so much for asking :-) and by the way any contribution would be the best I can hope for
I don't understand the hash index part. I guess that for every segment on disk, you also have a hash table for it, correct? Also, this part doesn't seem to be very memory-efficient
Besides the use of LSM trees in RocksDB and leveldb-like databases, there is also the WiscKey approach (https://www.usenix.org/node/194425) that helps read/write amplification by keeping the LSM tree small and mostly removing the values to a separate value log. There's a pure Go implementation of the WiscKey approach used by dgraph: https://github.com/dgraph-io/badger