Hacker News new | ask | show | jobs
by mildbyte 2173 days ago
Shameless plug (I'm a co-founder) but this is basically what we've built with Splitgraph[0]: we can add change tracking to tables using PostgreSQL's audit triggers and let the user switch between different versions of the table / query past versions.

[0] https://www.splitgraph.com/product/data-lifecycle/research

2 comments

Change tracking is not a fully bitemporal scheme, though. A bitemporal table tracks _two_ timelines. One is about when facts in the world were true ("valid time" or "application time"), the other is about the history of particular records in the database ("transaction time" or "system time"). Change tracking can only capture the second.
This was extremely confusing to me on first read but, since temporal data is an area of interest to me for improving an existing service I maintain as well as for future endeavors, it prompted me to go do a bunch of reading to understand what was meant by the distinction. If others are reading this and confused, maybe what I learned will help make the distinction clearer.

There’s two cases where valid/applicable time might be meaningful:

1. Future state known at present time (e.g. when you know now that a fact will change at a specific future point in time).

2. Corrections to current state which should be applied to historical state (e.g. when you know your application produced invalid state and you want to produce a valid state both at present and in historical representations).

The first case is used more in the literature I found, but didn’t really make the distinction clearer for me because I have limited domain use for that kind of behavior. The correction case really drove the point home for me, because my use cases would benefit from it considerably.

I hope this helps other readers interested in the topic but struggling to visualize the two timelines and how they could be used.

It is hard to maintain the two timelines in one's mind at once. Most of the time I maintain the dehydrated version that "you need both" and talk myself through the details when I need them.

Typically what happens is that folks incorrectly capture parts of both timelines. For "valid time", folks will have "created at", "deleted at", "updated at" fields etc and assume this covers it. Unfortunately, it doesn't really capture proper spans. If I have a fact that starts being true on Tuesday, then add another one that starts being true on Thursday, can I deduce that the first fact was true from Tuesday until Thursday?

No, I can't logically make that claim. It's possible that the first fact was true on Tuesday only, leaving a gap on Wednesday. Without explicit valid time, I can't find those kinds of holes, or apply constraints against them.

Similarly, folks rely on history tables, audit tables etc to try to capture transaction time changes. These are still not enough, for the same reason. You need the span of time, the interval, to properly assert the time during which the database held a record to be true. When we discover that the Thursday fact was wrong, we need to be able to know over what window we held that to be true. Overwriting the record, or entering an update into a log etc, is not enough to truly reconstruct the history. You must be explicit about the span.

The necessity of bitemporalism was easy for me, because I had been responsible for either creating or maintaining multiple databases in which I could not answer a wide range of questions after the march of time. I learned many of the standard hacks (copy the price of a stocked item into the invoice line item, or it will change beneath you! Use slowly changing dimensions! Have a transactions table which creates a slow-as-hell copy of what the database does internally, but not as well!). When I read Snodgrass's book it all went clunk and I've been a wild-haired proselyte ever since.

That sounds neat. What does the performance of querying past versions look like? For instance, is lookup time linear with the amount of history or do you maintain special temporal indexes?
It varies depending on how the user chooses to structure storage (we're flexible with that) and what mode of querying they use. We have a more in-depth explanation and some benchmarks in an IPython notebook at [1].

We store Splitgraph "image" (schema snapshot) metadata in PostgreSQL itself and each image has a timestamp, so you could create a PG index on that to quickly get to an image valid at a certain time.

Each image consists of tables and each table is a set of possibly overlapping objects, or "chunks". If two chunks have a row with the same PK, the row from the latter will take precedence. Within these constraints, you can store table versions however you want -- e.g. as a big "base" chunk and multiple deltas (least storage, slowest querying) or as a multiple big chunks (faster querying, more storage).

You can query tables in two ways. Firstly, you can perform a "checkout". Like Git, this replays changes to a table in the staging area and turns it into a normal PostgreSQL table with audit triggers. You get same read performance (and can create whatever PG indexes you want to speed it up). Write performance is 2x slower than normal PostgreSQL since every change has to be mirrored by the audit trigger. When you "commit" the table (a Splitgraph commit, not the Postgres commit), we grab those changes and package them into a new chunk. In this case, you have to pay the initial checkout cost.

You can also query tables without checking them out (we call this "layered querying" [2]). We implemented this through a read-only foreign data wrapper, so all PG clients still support it. In layered querying, we find the chunks that the query requires (using bloom filters and other metadata), direct the query to those and assemble the result. The cool thing about this is you don't have to have the whole table history local to your machine: you can store some chunks on S3 and Splitgraph will download them behind the scenes as required, without interrupting the client. Especially for large tables, this can be faster than PostgreSQL itself, since we are backed by a columnar store [3].

[1] https://www.splitgraph.com/docs/getting-started/frequently-a...

[2] https://www.splitgraph.com/docs/large-datasets/layered-query...

[3] https://www.splitgraph.com/docs/concepts/objects

Thanks for the explanations. That is a very intriguing use-case for foreign data wrappers!