| You've given us some idea of the volume of your data but there's no mention of what's ingesting it or how. > during these stress tests the hardware is nowhere close to over-encumbered, and there's consistent headroom on both memory, CPU and disk I/O This assertion is likely wrong - you're likely skipping over some metrics that has clues to what we need to know. Here are some questions to get the discussion moving. - Is this PostgreSQL managed or self-hosted? Your mention of "consistent headroom on both memory, CPU and disk I/O" gives me hope you're self-hosting it but I've heard the same thing in the past from people attempting to use RDS and wondering the same as you are, so no assumptions. - Are you using COPY or multi-row INSERT statements? - How much RAM does that server have? - What is the fillfactor, max_wal_size and checkpoint_timeout? - Is the WAL on NVMe? - What's the iostat or wa during the slowdown? - Are random UUIDs (part of) the index? Have you posted to https://dba.stackexchange.com/ If I were you, I would create a GitHub repo that has scripts that synthesize the data and reproduce the issues you're seeing. |
We can clearly see with iostat/iotop that the server and its storage isn't overworked.
> Is this PostgreSQL managed or self-hosted?
We're evaluating on our own bare metal. It's an 8c/16t Zen 3 with 32 GB of RAM. Storage (where everything incl. the WAL is) is NVMe and the drives' true sustainable/synchronous write speed is about ~750 MiB/second.
The specs are far, far higher than required for something this basic. Total memory usage when we begin observing the problem is a fraction of what's available. The MySQL production and test environments running this without hindrances actually have only half the amount of RAM.
> Are you using COPY or multi-row INSERT statements?
Multi-row INSERTs, one per transaction, with anything from 100 to 500 rows. Evaluation simulates the volumes we can see through our APIs on production, though it omits everything but Postgres itself in order to test Postgres' ingestion capability without other factors getting in the way - it's just Postgres and a light-weight data generator mimicking production data.
> What is the fillfactor, max_wal_size and checkpoint_timeout?
Not sure about the fill factor. Everything is running on default Postgres 17.9 as packaged for Debian 13, so that would be 1GB max_wal_size and checkpoint_timeout is commented out with a default of 5 minutes. Haven't gotten to any performance tuning yet, would be thankful for any suggestions to try out. The only thing we've tried is disabling auto-vacuuming to see if it was too frequent, causing i/o contention or otherwise hogging throughput. Not really any noticable change.
> Are random UUIDs (part of) the index?
No, neither v4 UUIDs nor b-tree-friendlier v7 UUIDs, but a couple of the indexed columns contain random integral numbers that can become quite sprawly inside a b-tree. We do observe somewhat better results indexing these with HASH indices instead, which also makes a lot more sense for that particular data and how we query it. For evaluation our outset was to stick with b-tree indices because that's what's used on the MySQL setup (InnoDB does not support on-disk hash indices).
> What's the iostat or wa during the slowdown?
When we reach the point where ingestion speed has shrunk to about a third we observe iowait peaks having grown to some 15%, which tells us the problem is likely Postgres spending more and more of its time shoveling in the indices rather than storing actual row data. Maximum written data at about 150 MB/second is just a fraction of what the NVMe drives can sustain. None of Postgres' individual processes ever top out anywhere close to 100% of a single core on the machine. Total memory usage is less than 2 GB, and here we suspect we have a lot of tuning to look into. To contrast, the MySQL setup is greedier with both CPU and memory usage, and an educated guess is that the major difference allowing it to keep ingesting hundreds of millions of rows without slowing down is that InnoDB, without us having to jump through any hoops, on one hand defers persisting new index data so that the DML can be finalized as quickly as possible, and on the other hand operates on its indices in a much more efficient way than Postgres.
Everything needed to reproduce the problem is in this paste, which contains a neutral version of the full table and indices, and a simplified version of the data generator: https://paste.debian.net/plainh/ddc819cb