Hacker News new | ask | show | jobs
by bremac 4110 days ago
I'm not sure I understand the purpose of the loop in the last example. AFAIK top-level plpgsql statements (including DO blocks run in psql) execute in a single transaction, so it seems like you end up slowly locking the entire table, as the transaction won't commit until the loop completes. (I learned this the hard way by trying to "batch"-update a table with tens of millions of rows in production.)

The normal way to handle batch updates is to perform the loop outside of postgresql, so that each batch is in its own transaction.

1 comments

  AFAIK top-level plpgsql statements (including DO blocks run in psql) execute in a single transaction
This is true according to the docs. Anonymous code blocks are "transient anonymous functions", and functions are executed within a transaction.

  it seems like you end up slowly locking the entire table
The selected rows would be locked for update, delete, and select for updates, but not for regular reads. Perhaps his users table is used primarily for reads, which made this command run with negligible consequences?

http://www.postgresql.org/docs/9.4/static/sql-do.html

http://www.postgresql.org/docs/9.4/static/plpgsql-structure....

Author here. That's a great point bhahn, I just updated my gist to properly handle the case that you just outlined:

https://gist.github.com/aanari/349c7d97ed50c6f69930#file-bat...

By creating a separate function for the locking and updating of rows, we ensure that the `BEGIN/END` transaction is handled per iteration rather than at the very end, so we only lock rows while they are being processed. Since Postgres does not support nested transaction blocks, calling a defined function from within an anonymous function block seemed to be the easiest and clearest path to achieve this.

https://gist.github.com/aanari/349c7d97ed50c6f69930#file-bat...

The anonymous block is implicitly called in a transaction, so all the calls to batch_at_will will be executed in the parent transaction; there's no way around this (except for using db_link but that's pretty smelly imo).

The only way to batch update while only locking rows in the batch is to run the loop outside of postgres like OP suggested.