Hacker News new | ask | show | jobs
by mickeyp 969 days ago
Using audit tables is another approach that avoids rolling up the data periodically, and you avoid all the needless complexity this approach can introduce; particularly if you're not 100% sure you need it.

You gain temporal querying with a history/audit table that mirrors the main one (or not -- if you prefer not to, and instead want to cram everything into a jsonb column).

Combine it with TSTZRANGE and you can better express the bounds of when the data was last amended and valid. A "period" column has a range `[start, null)` indicating it's current; `[start, end)` indicating an older record. Your main table is always current, with the history table recording every change made. The benefit of this approach is that you can use a GiST index and postgres' datetime ranges to find rows that intersect a point or range in time. If it gets too big, then think of ways you can roll up or prune old records, as needed.

And you can have all of this without compromising on your table structure. Using an ORM and you can have it mirror your audit tables as needed, or of course you can use any number of other methods, such as, ugh, table inheritance or a bit of plpgsql + create table as magic.

Audit tables are useful, and they can approximate a lot of this event system stuff but without the downsides.

4 comments

No doubt audit tables are a popular alternative to event sourcing. But if the current state and changes history of the entity are stored in different tables, someone may say: "Prove me that your audit log is correct". Because you are not using audit table for the business logic, you may not immediately notice the problem with it that corrupts the audit log. Event Sourcing provides other advantages, not only audit log. For example, a service command typically needs to create/update/delete aggregates in the DB (JDBC/R2DBC) and send messages to a Kafka. Without using the two-phase commit (2PC), sending a message in the middle of a transaction is not reliable. There is no guarantee that the transaction will commit. With Event Sourcing you have to subscribe to the event and send the message to Kafka from listener. The delivery guarantee is "at least once". Anyway, there is a demand for Event Sourcing on the market
Regarding publishes to a message broker, the transactional outbox pattern (mentioned in TFA , and something that can be used on its own) provides similar capabilities if you don't want to fully buy into event sourcing.

https://microservices.io/patterns/data/transactional-outbox....

I was thinking about this yesterday and your tip about TSTZRANGE is a good one. My main concern with audit table is that now every write becomes a transaction (if you are concerned with consistency between you master table and your audit table) and you have to decide on what content goes in your master table and what goes in your audit table. When I tried to sketch this out for my use case it turns out to be non-trivial.

I was thinking about this specifically for a jobs table where the jobs go through states (e.g. pending, in-progress, completed, failed). Having an audit of each of those states alongside relevant details might be useful. However, once I start hammering out the details I find myself reaching for pl/pgsql and worrying about putting too much business logic into my db. It starts to feel like I'm building a state machine on top of SQL.

I actually think this dual use-case is something we can really improve upon. In some fever-dream what I think I want is some combination of Kafka (append-only event stream) and Postgres (durable/reliable snapshot of current state).

> I actually think this dual use-case is something we can really improve upon. In some fever-dream what I think I want is some combination of Kafka (append-only event stream) and Postgres (durable/reliable snapshot of current state).

Debezium will capture SQL changes to Kafka for you.

Kafka is amazing for what it is made for, but it doesn't seem to solve the query problem for past states. For example, if there is some intermediary details related to the "in-progress" state that will get overwritten once the job transitions into the "complete" state (or "error" state) then that is non-trivial to query from Kafka.

Even in you decide to keep those intermediary states in the main table then there are other niggles, like retries. If a job gets picked up and fails then I might write to an `error_details` column in the main table. However, if I have retries and the job fails a couple of times then only the latest error details are in the main table. If I want to reconstruct the history of the job I have to somehow retrieve each error event for that job from my append only log. And now I'm querying across systems and combining the data in the application tier.

I'm not saying these aren't solvable problems or that there doesn't exist tools already that can achieve what I'm talking about. Engineers love to say "why don't you just ..." for almost any conceivable problem. What I mean to say is that we seem to be separating things into different systems (append only logs vs. rdbms) which feel like they might be more tightly related. rdbms are like one half and append only logs are the other half. Maybe one day those halves will be combined.

Yes, Debezium is an implementation of the Transaction log tailing pattern an alternative to Transactional outbox pattern.
> You gain temporal querying with a history/audit table that mirrors the main one (or not -- if you prefer not to, and instead want to cram everything into a jsonb column).

I rather liked that MariaDB has temporal tables out of the box: https://mariadb.com/kb/en/temporal-tables/

I think for PostgreSQL there was an extension last I checked: https://pgxn.org/dist/temporal_tables/ (edit: this doesn't appear very popular)

Pretty nice, when the database does the heavy lifting and you have to do a bit less work yourself.

I love temporal tables. I built an implementation in Postgres [1] in a few hundred lines of PLPGSQL. We've run the implementation for several years across a few hundred tables for an ERP product.

The core ideas of the implementation:

- The current table is a typical Postgres table. The past table's primary key is (curr_tbl.pk, asr), where the asr column is the valid period of the row using a domain on tstzrange.

- Create statement-based triggers on the current table to copy the after-image of modified rows to the past table. Statement-based triggers amortize function call overhead compared to row-base triggers (I think depesz compared the performance and found about a 10x difference, but I can't find the link).

- Copying the row after-image on update and delete to the past table is advantageous because inserting a new row has no overhead. The disadvantage is that it's harder to figure out when a row was deleted (requires checking for gaps in the past table).

Some pointers if you dive into uni-temporal tables in Postgres:

- Separate the past table and current table into different schemas. Most database users shouldn't modify the past tables. It's easier to grant permission by schema and it makes autocomplete nicer. By convention, we use erp.invoice and erppast.invoice.

- Use a domain type instead of tstzrange to blunt the sharp edges of tstzrange. See the temporal.period domain type in [1].

- Resist the temptation to query against the past table. Temporal joins are fabulously complex without first-class database support (like Oracle's AS OF).

- Optimize for the typical case: querying the current table. Our first temporal implementation used table inheritance consisting of three tables: a parent table, the current table, and a past table. Theoretically, the parent table lets you query data transparently across the parent and child tables. In practice, we didn't use the parent query capability at all. Having a parent table made code review harder by requiring the reviewer to check that the code queried the current table and not the parent table. It's easy enough and rare enough to query all versions of a row by using:

    SELECT * FROM current_table
    UNION ALL
    SELECT * FROM past_table
- Track the current and past tables in a metadata table. Use tests to check that columns don't drift between the current and past tables. See misaligned_cols.sql in [1] for an example.

- Unfortunately, you can't use declarative partitioning because the primary key of the past table differs from the current table. The past table must include the valid period column to differentiate it from other versions of the same row.

[1]: https://gist.github.com/jschaf/cd1c1a3c2a5897282929ee5e16f94...

How do you manage schema changes?
Table renames don’t break anything.

Changing columns must be done to both tables in a transaction. Unit tests verify that the table columns match.