| > This assertion is likely wrong We can clearly see with iostat/iotop that the server and its storage isn't overworked. > Is this PostgreSQL managed or self-hosted? We're evaluating on our own bare metal. It's an 8c/16t Zen 3 with 32 GB of RAM. Storage (where everything incl. the WAL is) is NVMe and the drives' true sustainable/synchronous write speed is about ~750 MiB/second. The specs are far, far higher than required for something this basic. Total memory usage when we begin observing the problem is a fraction of what's available. The MySQL production and test environments running this without hindrances actually have only half the amount of RAM. > Are you using COPY or multi-row INSERT statements? Multi-row INSERTs, one per transaction, with anything from 100 to 500 rows. Evaluation simulates the volumes we can see through our APIs on production, though it omits everything but Postgres itself in order to test Postgres' ingestion capability without other factors getting in the way - it's just Postgres and a light-weight data generator mimicking production data. > What is the fillfactor, max_wal_size and checkpoint_timeout? Not sure about the fill factor. Everything is running on default Postgres 17.9 as packaged for Debian 13, so that would be 1GB max_wal_size and checkpoint_timeout is commented out with a default of 5 minutes. Haven't gotten to any performance tuning yet, would be thankful for any suggestions to try out. The only thing we've tried is disabling auto-vacuuming to see if it was too frequent, causing i/o contention or otherwise hogging throughput. Not really any noticable change. > Are random UUIDs (part of) the index? No, neither v4 UUIDs nor b-tree-friendlier v7 UUIDs, but a couple of the indexed columns contain random integral numbers that can become quite sprawly inside a b-tree. We do observe somewhat better results indexing these with HASH indices instead, which also makes a lot more sense for that particular data and how we query it. For evaluation our outset was to stick with b-tree indices because that's what's used on the MySQL setup (InnoDB does not support on-disk hash indices). > What's the iostat or wa during the slowdown? When we reach the point where ingestion speed has shrunk to about a third we observe iowait peaks having grown to some 15%, which tells us the problem is likely Postgres spending more and more of its time shoveling in the indices rather than storing actual row data. Maximum written data at about 150 MB/second is just a fraction of what the NVMe drives can sustain. None of Postgres' individual processes ever top out anywhere close to 100% of a single core on the machine. Total memory usage is less than 2 GB, and here we suspect we have a lot of tuning to look into. To contrast, the MySQL setup is greedier with both CPU and memory usage, and an educated guess is that the major difference allowing it to keep ingesting hundreds of millions of rows without slowing down is that InnoDB, without us having to jump through any hoops, on one hand defers persisting new index data so that the DML can be finalized as quickly as possible, and on the other hand operates on its indices in a much more efficient way than Postgres. Everything needed to reproduce the problem is in this paste, which contains a neutral version of the full table and indices, and a simplified version of the data generator: https://paste.debian.net/plainh/ddc819cb |
I strongly encourage you to post to https://dba.stackexchange.com/, as a HN thread is the wrong place for this discussion (there's a lot of tuning ahead of you and others who are in your situation in the future might skip this nested thread) but be forewarned, the TLDR from dba.stackexchange will either be a quick "you need atleast 64 GB RAM for your PostgreSQL usecase" or there will a better, thorough discussion of increasing max_wal_size and lowering the fillfactor (which is what you're looking for). The ideal answer should even walk you through BRIN (vs btree) indexes. I'm asking you to post there because it will enumerate the WHY of all of these. At that point you can make an informed decision if this all would be worth it.
Now I empathize that MySQL is doing a phenomenal job at only a quarter of that (16 GB you said?) but not for the reasons you might think (and one can certainly argue, nor should an end user care!). MySQL's method of buffering (InnoDB change buffers) and its clustered index gives you the performance you like when suddenly doing bursty writes aka "write sprint". I need you to be aware of that (and dba.stackexchange responses will certainly address that).
I would have written a lengthy post on what to do next but I must first ask:
- Why are you evaluating PostgreSQL in this case when MySQL seems to work well? For example: does it feel like your aggregations are getting slower? As you can see, with PostgreSQL, you will have different set of tradeoffs (RAM, tuning, VACUUM)
- Are there real, limiting business constraints that force you to operate on less than 64 GB of RAM given your volume and throughput expectations (like FF limitations, or these are smaller machines on the edge, etc)
- If you can, as an experiment, while you write your dba.stackexchange question and for the PostgreSQL mailing list, you can tweak multiple parameters and tell me what you see:
- I'm concerned you have a `shared_buffers = 128MB`. Set it to `shared_buffers = 8GB` (give the B-trees room to live in RAM)
- Increase `max_wal_size = 16GB` (stop the checkpoint flooding and let Postgres "breathe" during your batch inserts).
- Increase `checkpoint_timeout = 30min` (set 30min to the actual window it takes; also, this is temporary but this should push checkpoints out so they don't interrupt your "write sprint").
- Set `maintenance_work_mem = 2GB` (should speed up index creation and vacuuming).
- Lower `fillfactor` on those specific 8 indexes from the default 100 down to 70. The B-trees should now have the ability to absorb those inserts better.
This should get you in the right direction, googling for the right documentation, but there's even more ahead of you, including a separate discussion about BRIN (vs btree) indexes.
> defers persisting new index data so that the DML can be finalized as quickly as possible
"defers persisting" might get misread as if it doesn't write to disk - it does but you're close and you will uncover more :)
MySQL defaults are specifically tuned for your "write sprint" usecase, infact, to actually mask the IO latency of secondary index updates but if your real usecase is not just large sudden bursts of writes to a table that has a btree index, you certainly will appreciate this effort. Happy weekend!