Hacker News new | ask | show | jobs
by teej 2075 days ago
This seems neat!

Last year I helped a friend diagnose an issue they had with Postgres. A database table with a scheduled full DELETE/INSERT had slowed to the point of failure. It turns out, having slightly less IO than needed led the auto-VACUUM process to get further and further behind each time it ran.

My friend simply provisioned more IO and moved on. Another option would be to rewrite the process to naturally produce fewer dead rows. It would be great to have a third feasible option.

2 comments

Such workflows with scheduled DELETE/INSERT often mean that the data is "derived", there's unlogged tables feature for that in PostgreSQL. Table configured with unlogged are not being written to WAL and thus generate much much less I/O. The downside is that after a crash occurs the table might be empty (before it is repopulated by the DELETE/INSERT workflow again).
It should be an atomic replace - create a new table, do your inserts, swap the new table in. This way you aren’t left without the data during the rebuild, and you are guaranteed a consistent state since DDL is transactional in PostgreSQL.

Make the new table unlogged while it’s being populated as well for that performance improvement if you want, since the atomic replace guarantees your consistency anyway.

Alternative if it’s not a full rebuild, move from a delete/insert to using INSERT .. ON CONFLICT DO UPDATE where possible. I do this for cache tables that store materialized summary views that are updated by pgAgent jobs.

If you are deleting “old” data out then maybe use partitioned tables so you can just DROP the table containing the data you want gone.

Yeah, transactional DDL is a good thing, the potential problem with this solution though — if you have views pointing to the old table they won’t be updated with the new table and you will have to deal with them as well.
UNLOGGED tables is new to me, thanks for sharing.
It's tremendously useful for ETL.
That's super interesting. Can you share some more information on why it's so useful for ETL?
Think of it as a temporary table that doesn't automatically disappear at the end of the session. If nothing but your ETL process cares about the data until it's done being processed, and you still have the original data until the processing is done, in the worst case scenario (the server crashing) you just have to start your ETL process from the start.
Why not use partitioned tables? Simply drop/archive the child tables that you don't need, and avoid the vacuum overhead.
That’s a different thing. In some cases the whole table needs to be replaced. Then inserts into UNLOGGED tables are still much faster than into regular tables (even if there wasn’t a DELETE before).
well there is truncate for that. (not mvcc safe)
The table needed to never be visible as empty, so truncate wasn't a short term option.
I solved this by using partitions (at the time it was done through inheritance, where you could make one table made out of several smaller ones), and just dropping a table with unneeded data.