Hacker News new | ask | show | jobs
by postgressomethi 994 days ago
Polling an updated_at column is not robust in its most simple form, as transactions are not guaranteed to commit in that order.
3 comments

Author here. Good point. For those that are curious, parent is referring to the following situation:

1. Transaction A starts, its before trigger fires, Row 1 has its updated_at timestamp set to 2023-09-22 12:00:01.

2. Transaction B starts a moment later, its before trigger fires, Row 2 has its updated_at timestamp set to 2023-09-22 12:00:02.

3. Transaction B commits successfully.

4. Polling query runs, sees Row 2 as the latest change, and updates its cursor to 2023-09-22 12:00:02.

5. Transaction A then commits successfully.

A simple way to avoid this issue is to not poll close to real-time, as the order is eventually consistent.

Perhaps a more robust suggestion would be to use a sequence? Imagine a new column, `updated_at_idx`, that incremented every time a row was changed.

Sequences kind of have the same issue, because you don't know if a gap is because of a rollback or an uncommitted transaction. Though with some logic you can do a pretty good job at this with sequences. And then you're not in the realm of "simple" anymore, at all.
Any ideas for a simple polling implementation that's more robust?
It's not exactly simple as it involves some postgres specific knowledge, but you can make it reliable when working with transaction ids (see https://event-driven.io/en/ordering_in_postgres_outbox/).
I've had pretty much the exact same problem and what I went for in my low-volume case was to simply add advisory locks such that I can guarantee the transaction start times provide correct ordering.
Set the trigger to add the primary key + change time to a separate table, then scan/truncate that table to poll changes.
Woah, that's news to me. Is that true even if triggers are used to update a column?

  CREATE OR REPLACE FUNCTION
      update_updated_at_function()
  RETURNS TRIGGER AS $$
  BEGIN
      NEW.updated_at = now();
      RETURN NEW;
  END;
  $$ language 'plpgsql';


  CREATE TRIGGER
      update_updated_at_trigger
  BEFORE INSERT OR UPDATE ON
      "my_schema"."my_table"
  FOR EACH ROW EXECUTE PROCEDURE
      update_updated_at_function();
  END $$;

Is it possible for two rows to have `updated_at` timestamps that are different from the transaction commit order even if the above function and trigger are used? It's alright if `updated_at` and the commit timestamp are not the same, but the `updated_at` must represent commit order accurate to the millisecond/microsecond.
To confirm your fear, you can't use the updated_at timestamp as a proxy for commit order. The commits happen in a different order, and can be arbitrarily far apart, like hours or days depending on how long your transactions can last.
now() is the timestamp the transaction began at. There is no function to return the commit timestamp because you have to write the value before you commit.
For polling, instead of updated_at, I use a _txid column that gets set by a trigger to the current transaction ID. Then, when polling, use txid_current() to see which transactions have committed and which haven’t. It’s a little dicey and super easy to hit fencepost errors, but it’s been running smoothly in production for a few years.