Hacker News new | ask | show | jobs
by derefr 2117 days ago
Does anyone know of an embedded key-value store that does do versioning/snapshots, but doesn’t bother with cryptographic integrity (and so gets better OLAP performance than a Merkle-tree-based implementation)?

My use-case is a system that serves as an OLAP data warehouse of representations of how another system’s state looked at various points in history. You’d open a handle against the store, passing in a snapshot version; and then do OLAP queries against that snapshot.

Things that make this a hard problem: The dataset is too large to just store the versions as independent copies; so it really needs some level of data-sharing between the snapshots. But it also needs to be fast for reads, especially whole-bucket reads—it’s an OLAP data warehouse. Merkle-tree-based designs really suck for doing indexed table scans.

But, things that can be traded off: there’d only need to be one (trusted) writer, who would just be batch-inserting new snapshots generated by reducing over a CQRS/ES event stream. It’d be that (out-of-band) event stream that’d be the canonical, integrity-verified, etc. representation for all this data. These CQRS state-aggregate snapshots would just be a cache. If the whole thing got corrupted, I could just throw it all away and regenerate it from the CQRS/ES event stream; or, hopefully, “rewind” the database back to the last-known-good commit (i.e. purge all snapshots above that one) and then regenerate only the rest from the event stream.

I’m not personally aware of anything that targets exactly this use case. I’m working on something for it myself right now.

Two avenues I’m looking into:

• something that acts like a hybrid between LMDB and btrfs (i.e. a B-tree with copy-on-write ref-counted pages shared between snapshots, where those snapshots appear as B-tree nodes themselves)

• “keyframe” snapshots as regular independent B-trees, maybe relying on L2ARC-like block-level dedup between them; “interstitial” snapshots as on-disk HAMT ‘overlays’ of the last keyframe B-tree, that share nodes with other on-disk HAMTs, but only within their “generation” (i.e. up to the next keyframe), such that they can all be rewritten/compacted/finalized once the next keyframe arrives, or maybe even converted into “B-frames” that have forward-references to data embedded in the next keyframe.

5 comments

Worked on a project with similar goals [0]. By no means a production level implementation, but much of the system exists as a proof-of-concept.

[0]: https://makedist.com/projects/cruzdb/

You need something like HBase, but embedded. MVCC would give you the snapshot isolation (perhaps there's something with less guarantees?) and you'd need key lexicographic ordering to do efficient scanning. You'd only need the memory layout (e.g. LSM) if you'd keep a write-ahead-log from which to recover.

LevelDB / RocksDB (and related) may be close, but not sure about MVCC aspects (see https://www.cockroachlabs.com/blog/cockroachdb-on-rocksd/)

You misinterpreted, I think. The point isn’t “snapshot isolation” in the MVCC sense (working with multiple snapshots-in-progress); it’s the ability to, in est, work with the database the way git works with commits: opening a transaction “on top of” a base commit, then “committing” that transaction to create a new commit object, with its own explicit ref, where you can later “check out” an arbitrary ref.

Except, unlike git, this database wouldn’t need to be able to create new commits off of anywhere but the HEAD; and also wouldn’t need to be able to have more than one in-progress write transaction open at a time. No need for MVCC at all; and no need for a DAG. The “refs” would always just be a dense linear sequence.

Also, unlike git (or a cryptographically-verified / append-only store), there’s no need to keep around “deleted” snapshots. It would actually be a huge benefit to be able to purge arbitrary snapshots from the database, without needing to do a mark-and-sweep liveness pass to write out a new copy of the database store.

The key constraint that differentiates this from e.g. a Kafka Streams-based CQRS/ES aggregate, is that you should be able to reopen and work with any historical database version instantly, with equal-amortized-cost lookups from any version, without needing to first do O(N) work to “replay” from the beginning of time to the snapshot, or to “rewind” from the HEAD to the snapshot. This system would need all snapshots to be equally “hot” / “online” for query access, not just the newest one.

In other words, such a database should work just like filesystem snapshots do in a copy-on-write filesystem.

It seems like coupling a database checkpoint process with file system’s snapshot process should be theoretically possible: 1) Database informed snapshot needed 2) Database finalizes any in progress writes and starts logging new writes to another file 3) Take file system snapshot 4) Inform database snapshot is fine

Between #3 and the file system snapshot, you should have a perfect and quick representation of the database at that point in time (when the database was informed it should stop logging).

File system snapshots are a system that have the analogous desired properties for files; but filesystem snapshots are actually quite heavyweight, because they deal with dirents, inodes, extents, etc. CoW filesystem snapshots are designed for ops-task-granularity usage, e.g. daily backups; not for per-transaction historical archiving. CoW filesystems tend to fall over once you get to 100K snapshots. (I tested!) A database that took a snapshot after every CQRS/ES transaction, could be expected to potentially have billions of snapshots.

A system that did its snapshots “inline” to itself, by e.g. managing a pool of pages with a free-list the way LMDB does — but where Txs ultimately add a new version to the root bucket as a snapshot, rather than replacing the root bucket page with themselves — would get a lot closer to allowing one to have at least tens-of-millions of snapshots online. At that point, to achieve a billion snapshots online, you “only” need to shard your timeline across a cluster of 100 nodes.

This is precisely one of the experiments I’m trying. :)

Have you looked at LSM k-v stores (RocksDB being the obvious one)?

Under the hood they are based on building immutable layers of data that are implicitly merged. Clones that share data are cheap (check out rocksdb-cloud for a rocksdb fork which adds this copy-on-write-esque snapshotting). When overwriting a key, the old value will get lazily garbage-collected, but there are ways around that.

I haven't explored it for this usecase, but seems like it might work...

LSM trees are fast for writes, but not fast for reads. In fact, there’s essentially no difference in performance between an LSM tree, and a Merkle-tree whose keyspace is encoded within a B-tree.
If I am understanding you properly, couldn't you do this with SQL by specifying the range of data that represents the timeseries you care about, selected via materialized view?

If you used a stored procedure to compute the range that becomes the view, then all you need to store are the parameters to feed to the stored procedure again, which data you could itself store in a separate table.

As I said in a sibling comment — every version needs to be “hot” / “online” at the same time. The point of this system is to allow for random access to OLAP queries for arbitrary historical versions of the system; and, in fact, to even do time-series reports that perform a given analysis against every available version of the data, hopefully with some degree of parallelism. In matview terms, that means that every version of the data needs to be concurrently materialized.

Given just 100M keys (let’s call it a 20GB exported snapshot size), and 1M versions, that’s an overwhelming amount of data — and 99.9999% of it is redundant copies of the same information, i.e. the stuff that didn’t change between versions.

Solving the problem of the concurrent materializations requiring petabytes of storage for almost-entirely-redundant heap tuples, is essentially solving the problem of creating a tuple-deduplicating DBMS storage engine — which is equivalent to the problem of building a versioned embedded database :)

So use views instead of materialized view. How much time/effort could be saved by simply making better queries? Are you sure that you are looking for or controlling for the right problems?
The goal is to host an infrastructure to accelerate arbitrary user queries, ala a Business Intelligence data-warehouse backend. We don't get to specify what queries the users are doing. It's the classical problem that SQL RDBMSes were introduced to solve: having the data, and having to shape it in advance of knowledge of reporting workload.
Well, it all depends on what operations you need. exact key lookup/write? -> Easy, just use any KV store and append the version to the key, then do a ceil/floor lookup in the KV with the key+target version.

Supporting efficient range scans is hard though.

EDIT: yeah, OLAP will be hard.