| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by teej 2075 days ago

This seems neat!

Last year I helped a friend diagnose an issue they had with Postgres. A database table with a scheduled full DELETE/INSERT had slowed to the point of failure. It turns out, having slightly less IO than needed led the auto-VACUUM process to get further and further behind each time it ran.

My friend simply provisioned more IO and moved on. Another option would be to rewrite the process to naturally produce fewer dead rows. It would be great to have a third feasible option.

2 comments

andreypopp 2075 days ago

Such workflows with scheduled DELETE/INSERT often mean that the data is "derived", there's unlogged tables feature for that in PostgreSQL. Table configured with unlogged are not being written to WAL and thus generate much much less I/O. The downside is that after a crash occurs the table might be empty (before it is repopulated by the DELETE/INSERT workflow again).

link

snuxoll 2075 days ago

It should be an atomic replace - create a new table, do your inserts, swap the new table in. This way you aren’t left without the data during the rebuild, and you are guaranteed a consistent state since DDL is transactional in PostgreSQL.

Make the new table unlogged while it’s being populated as well for that performance improvement if you want, since the atomic replace guarantees your consistency anyway.

Alternative if it’s not a full rebuild, move from a delete/insert to using INSERT .. ON CONFLICT DO UPDATE where possible. I do this for cache tables that store materialized summary views that are updated by pgAgent jobs.

If you are deleting “old” data out then maybe use partitioned tables so you can just DROP the table containing the data you want gone.

link

andreypopp 2075 days ago

Yeah, transactional DDL is a good thing, the potential problem with this solution though — if you have views pointing to the old table they won’t be updated with the new table and you will have to deal with them as well.

link

teej 2075 days ago

UNLOGGED tables is new to me, thanks for sharing.

link

jacques_chester 2075 days ago

It's tremendously useful for ETL.

link

cpard 2075 days ago

That's super interesting. Can you share some more information on why it's so useful for ETL?

link

djur 2075 days ago

Think of it as a temporary table that doesn't automatically disappear at the end of the session. If nothing but your ETL process cares about the data until it's done being processed, and you still have the original data until the processing is done, in the worst case scenario (the server crashing) you just have to start your ETL process from the start.

link

sharadov 2075 days ago

Why not use partitioned tables? Simply drop/archive the child tables that you don't need, and avoid the vacuum overhead.

link

andreypopp 2075 days ago

That’s a different thing. In some cases the whole table needs to be replaced. Then inserts into UNLOGGED tables are still much faster than into regular tables (even if there wasn’t a DELETE before).

link

merb 2075 days ago

well there is truncate for that. (not mvcc safe)

link

teej 2075 days ago

The table needed to never be visible as empty, so truncate wasn't a short term option.

link

takeda 2075 days ago

I solved this by using partitions (at the time it was done through inheritance, where you could make one table made out of several smaller ones), and just dropping a table with unneeded data.

link