Can anyone explain why I would want to use this over an orchestration tool that lives outside the DB? Read through the Readme and some of the examples, I still don't get it.
Snapshot PITR of your database means everything restores including the durable jobs at the PIT.
Don't need to synchronize the backups with anything else that is part of the same data store, good for ETL pipelines and other state machine type jobs.
If your ETL is mostly SQL anyway, then having the actual job being run on the same server helps as well.
Yes, but that doesn't have to imply that the compute part of the durable jobs framework also needs to be part of the database snapshot. You almost certainly want that defined in code anyway, if only to have a sane versioning story. So then by having it also be part of the snapshot, you've now got the problem that there are apparently two sources of truth for that bit of the code.
Contributor here. At Microsoft, our Postgres customers seem to split pretty evenly into 2 camps, those that want to do as much as they can in the database, and those that agree with your take - want to keep apps and compute outside the DB.
I bet this is correlated with how much they like/know Postgres already. When people don’t understand their database’s features, they want it to behave like something else they do understand (code). They’re leaving a lot of performance on the table by not leveraging everything their database can do.
Understanding and deciding to rely upon are not the same. Complexity is one of the biggest challenges in real world software systems. Keeping a subsystem’s responsibilities simple can do a lot for a system’s reliability. Most teams have extremely mature systems for managing complexity in their code - tests, CICD etc. Sure you CAN build all those things for your database too, but it’s more work. Most teams I’ve worked on choose to minimize database migrations because it’s a lot of work to make that part of the system as robust against mistakes as code is. Choosing to ignore a feature in your database is often a very rational pragmatic choice aimed at keeping complexity under control.
Yup. Anymore, “we’ll handle that in code” reads to me as “I don’t understand my tools, and don’t want to learn.” Or hubris. The sheer number of times I’ve seen data integrity errors because someone didn’t think they needed a database-level constraint is too damn high.
The other one (also related) is normalization. They’ll have hundreds of millions of rows of duplicated, low-cardinality strings, because “joins are expensive,” while somehow missing the fact that the increased I/O from reduced page-packing also has a performance cost.
Yeah but the increased I/O is cheaper. It's easier to add another webserver as opposed to upgrading your db server.
And I don't think it's as simple as you make it. So where I work we use amongst other things Rails. There are places in our codebase where using joins just isn't feasable cause the database would use too much memory and let's just say N is large.
But since we use Rails we can have it query the tables apart and join the tables through the defined model. We literally save like 20 seconds in some cases because 1 HUGE query becomes 8 straightforward ones with maximum index usage.
And because we have this capability in Rails we would never use something like this, cause that would neccesate us holding two mental models and have a clear "what do we run where" directive which honestly is a PTA.
That sounds like you’re either joining on an unindexed column, or have outdated statistics. 8 queries implies 8 tables, which is well under the limit for both Postgres and MySQL where the planner may give up and choose a suboptimal plan. You may have something like “SELECT * FROM foo JOIN bar ON foo.id = bar.foo_id WHERE bar.baz = 'qux'”, and if there’s no composite index on (bar.foo_id, bar.baz), the planner will choose whichever column it thinks is more selective; it then has to go get the value for the other one, and that can be quite expensive at scale. Even if you have a separate index on each of those, there’s no guarantee the planner will decide to merge them.
> Yeah but the increased I/O is cheaper. It's easier to add another webserver as opposed to upgrading your db server.
I’m referring to I/O on the DB. Rows are stored in pages that are generally 8 KiB (Postgres and MS SQL Server default) or 16 KiB (InnoDB default). If you can fit 200 rows per page, a given query will probably have to fetch fewer pages than if you can only fit 100 rows per page.
> 8 queries implies 8 tables, which is well under the limit for both Postgres and MySQL where the planner may give up and choose a suboptimal plan.
I'm not sure thats true for Postgres. optimal join ordering is np-hard, and finding an optimal join requires exhaustive search through n! combinations (n=number of joins) - thats why postgres generally uses heuristics to figure out join order. 8 is also the default value of "join_collapse_limit" setting in postgres, so it can't ever reliably optimize over 8 joins at a time. Additionally, postgres starts using "genetic algorithms" aka testing random combinations of joins with 12 joins by default (geqo_threshold setting).
I generally agree its better to use database to its fullest, but I would say 8 joins is probably the "limit". Internally at work I've advised teams to try to avoid anything more than 6 joins for "hot-path" queries.
> literally save like 20 seconds in some cases because 1 HUGE query becomes 8 straightforward ones with maximum index usage.
I don’t understand how splitting a query up would have any relationship to index utilization; the planner should trivially pick up on it?
Also are you sure you’re not solving a different problem[0]? Doing joins manually being faster doesn’t smell right, except in the case of data duplication increasing total resultset size substantially
Like the cost of increased network load from not filtering through the join should outweigh anything else in the equation
You can have well-integrated applicative workflows (eg: progress report on a permalink in your front end app), app-restart-proof resumable workflows, and it avoids adding an extra piece of infrastructure.
We use Postgres for that on https://transport.data.gouv.fr (Elixir app which does a fair bit of processing), and it helps.
Not familiar yet with pg_durable though, but I have used or implemented similar solutions and can relate.
Don't need to synchronize the backups with anything else that is part of the same data store, good for ETL pipelines and other state machine type jobs.
If your ETL is mostly SQL anyway, then having the actual job being run on the same server helps as well.