Hacker News new | ask | show | jobs
by anarazel 2850 days ago
> Why not just consume the actual DB journal?

It's decidedly non-trivial to do so, as usually the journal doesn't contain all the information to do so. Journal writes are often a bottleneck so DB engines try to restrict their contents to just the necessary parts. Typically it'll e.g. not contain information about table schemas etc.

You can do so, see e.g. postgresql's logical decoding, but it's plenty of additional work.

1 comments

> Journal writes are often a bottleneck

The point I was trying to make is that the alternative—writing all your DB activity into the DB—means that you're already, 100% guaranteed, creating more overhead than the heaviest possible journal-safety properties would create. You're writing unabridged journal entries into the DB, where they then get even more journal attached.

There is no world where it makes more sense to write a copy of everything you do into an "everything I did" table, than it does to use an append-only / immutable / log-centric DBMS (where "Postgres with an ETL pipeline pointed at its WAL logs" is a rather hacky immutable DBMS.)

If you know, before you ever start coding, that you will need to know everything that ever happened, writing everything twice—when your DBMS is already writing everything twice—is a rather silly way to choose to go about solving your problems. You're already in a world where you need different guarantees than a traditional (ephemeral) RDBMS gives you; you just haven't realized it yet.

Perhaps it’s a dev vs ops problem. Only ops see the journal, understand it, access it. If the dev want a journal they end up building their own.
It's also a public versus private API issue. Most databases consider their journal a private implementation detail and don't recommend programming directly against it, as implementation can change as the developers change internal details from database software version to version, or even on a runtime whim as the system's query optimizer/write buffer state/etc needs change over time in some of the databases.

For instance, Microsoft SQL Server has a built-in public change tracking API, which you can tell is a leaky abstraction over top of the journal, but SQL Server makes an API guarantee when using that specific change tracking API whereas (at least the last time I looked) it will never make API guarantees about anything lower level than that that accesses its journal. The Change Tracking API documentation even warns you that it locks the journal into certain deoptimizations and that there are clear performance trade-offs, which also hints that using the lower level journal API more directly would likely hit optimization problems and data loss issues.

As a dev, it's rarely a good idea to program against APIs your vendor intentionally deems to be private/implementation details that can change between versions/may be unstable in runtime operation. If the database doesn't offer a public journaling API, then yes, sadly, sometimes the best or at least least-worst answer is to build a sub-par journal inside the database itself, with of course its own tradeoffs in performance.

> For instance, Microsoft SQL Server has a built-in public change tracking API, which you can tell is a leaky abstraction over top of the journal,

Postgres' logical decoding [1] provides something very similar.

[1]: https://www.postgresql.org/docs/current/static/logicaldecodi...

> There is no world where it makes more sense to write a copy of everything you do into an "everything I did" table, than it does to use an append-only / immutable / log-centric DBMS (where "Postgres with an ETL pipeline pointed at its WAL logs" is a rather hacky immutable DBMS.)

There is one: the world where append-only / immutable / log-centric DBMS do not offer the same advantages as mainstream RDBMSs - scalability, stability, performance, language support, open sourcing, active development, ubiquitous SQL knowledge, etc. etc.

The downside of bolting an event-sourcing or changelog system onto an established RDBMS like Postgres are (a) the cost of writing them, and (b) the performance / disk space cost. Both of those may be easier to deal with than, say, Datomic's licensing and Clojure-oriented APIs, especially if you can use an existing library like Marten.

What immutable DBMS would you recommend?

> writing everything twice—when your DBMS is already writing everything twice—is a rather silly way to choose to go about solving your problems

The GP specifically said the DBMS journal does not write everything: it writes as little as it can get away with, for performance reasons. Since you quoted GP very selectively, you didn’t address what they actually said.