| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by subhobroto 56 days ago

You've given us some idea of the volume of your data but there's no mention of what's ingesting it or how.

> during these stress tests the hardware is nowhere close to over-encumbered, and there's consistent headroom on both memory, CPU and disk I/O

This assertion is likely wrong - you're likely skipping over some metrics that has clues to what we need to know. Here are some questions to get the discussion moving.

- Is this PostgreSQL managed or self-hosted?

Your mention of "consistent headroom on both memory, CPU and disk I/O" gives me hope you're self-hosting it but I've heard the same thing in the past from people attempting to use RDS and wondering the same as you are, so no assumptions.

- Are you using COPY or multi-row INSERT statements?

- How much RAM does that server have?

- What is the fillfactor, max_wal_size and checkpoint_timeout?

- Is the WAL on NVMe?

- What's the iostat or wa during the slowdown?

- Are random UUIDs (part of) the index?

Have you posted to https://dba.stackexchange.com/

If I were you, I would create a GitHub repo that has scripts that synthesize the data and reproduce the issues you're seeing.

1 comments

daneel_w 55 days ago

> This assertion is likely wrong

We can clearly see with iostat/iotop that the server and its storage isn't overworked.

> Is this PostgreSQL managed or self-hosted?

We're evaluating on our own bare metal. It's an 8c/16t Zen 3 with 32 GB of RAM. Storage (where everything incl. the WAL is) is NVMe and the drives' true sustainable/synchronous write speed is about ~750 MiB/second.

The specs are far, far higher than required for something this basic. Total memory usage when we begin observing the problem is a fraction of what's available. The MySQL production and test environments running this without hindrances actually have only half the amount of RAM.

> Are you using COPY or multi-row INSERT statements?

Multi-row INSERTs, one per transaction, with anything from 100 to 500 rows. Evaluation simulates the volumes we can see through our APIs on production, though it omits everything but Postgres itself in order to test Postgres' ingestion capability without other factors getting in the way - it's just Postgres and a light-weight data generator mimicking production data.

> What is the fillfactor, max_wal_size and checkpoint_timeout?

Not sure about the fill factor. Everything is running on default Postgres 17.9 as packaged for Debian 13, so that would be 1GB max_wal_size and checkpoint_timeout is commented out with a default of 5 minutes. Haven't gotten to any performance tuning yet, would be thankful for any suggestions to try out. The only thing we've tried is disabling auto-vacuuming to see if it was too frequent, causing i/o contention or otherwise hogging throughput. Not really any noticable change.

> Are random UUIDs (part of) the index?

No, neither v4 UUIDs nor b-tree-friendlier v7 UUIDs, but a couple of the indexed columns contain random integral numbers that can become quite sprawly inside a b-tree. We do observe somewhat better results indexing these with HASH indices instead, which also makes a lot more sense for that particular data and how we query it. For evaluation our outset was to stick with b-tree indices because that's what's used on the MySQL setup (InnoDB does not support on-disk hash indices).

> What's the iostat or wa during the slowdown?

When we reach the point where ingestion speed has shrunk to about a third we observe iowait peaks having grown to some 15%, which tells us the problem is likely Postgres spending more and more of its time shoveling in the indices rather than storing actual row data. Maximum written data at about 150 MB/second is just a fraction of what the NVMe drives can sustain. None of Postgres' individual processes ever top out anywhere close to 100% of a single core on the machine. Total memory usage is less than 2 GB, and here we suspect we have a lot of tuning to look into. To contrast, the MySQL setup is greedier with both CPU and memory usage, and an educated guess is that the major difference allowing it to keep ingesting hundreds of millions of rows without slowing down is that InnoDB, without us having to jump through any hoops, on one hand defers persisting new index data so that the DML can be finalized as quickly as possible, and on the other hand operates on its indices in a much more efficient way than Postgres.

Everything needed to reproduce the problem is in this paste, which contains a neutral version of the full table and indices, and a simplified version of the data generator: https://paste.debian.net/plainh/ddc819cb

link

subhobroto 54 days ago

That's a phenomenal script that absolutely belongs on a dba.stackexchange question and even the PostgreSQL mailing list.

I strongly encourage you to post to https://dba.stackexchange.com/, as a HN thread is the wrong place for this discussion (there's a lot of tuning ahead of you and others who are in your situation in the future might skip this nested thread) but be forewarned, the TLDR from dba.stackexchange will either be a quick "you need atleast 64 GB RAM for your PostgreSQL usecase" or there will a better, thorough discussion of increasing max_wal_size and lowering the fillfactor (which is what you're looking for). The ideal answer should even walk you through BRIN (vs btree) indexes. I'm asking you to post there because it will enumerate the WHY of all of these. At that point you can make an informed decision if this all would be worth it.

Now I empathize that MySQL is doing a phenomenal job at only a quarter of that (16 GB you said?) but not for the reasons you might think (and one can certainly argue, nor should an end user care!). MySQL's method of buffering (InnoDB change buffers) and its clustered index gives you the performance you like when suddenly doing bursty writes aka "write sprint". I need you to be aware of that (and dba.stackexchange responses will certainly address that).

I would have written a lengthy post on what to do next but I must first ask:

- Why are you evaluating PostgreSQL in this case when MySQL seems to work well? For example: does it feel like your aggregations are getting slower? As you can see, with PostgreSQL, you will have different set of tradeoffs (RAM, tuning, VACUUM)

- Are there real, limiting business constraints that force you to operate on less than 64 GB of RAM given your volume and throughput expectations (like FF limitations, or these are smaller machines on the edge, etc)

- If you can, as an experiment, while you write your dba.stackexchange question and for the PostgreSQL mailing list, you can tweak multiple parameters and tell me what you see:

- I'm concerned you have a `shared_buffers = 128MB`. Set it to `shared_buffers = 8GB` (give the B-trees room to live in RAM)

- Increase `max_wal_size = 16GB` (stop the checkpoint flooding and let Postgres "breathe" during your batch inserts).

- Increase `checkpoint_timeout = 30min` (set 30min to the actual window it takes; also, this is temporary but this should push checkpoints out so they don't interrupt your "write sprint").

- Set `maintenance_work_mem = 2GB` (should speed up index creation and vacuuming).

- Lower `fillfactor` on those specific 8 indexes from the default 100 down to 70. The B-trees should now have the ability to absorb those inserts better.

This should get you in the right direction, googling for the right documentation, but there's even more ahead of you, including a separate discussion about BRIN (vs btree) indexes.

> defers persisting new index data so that the DML can be finalized as quickly as possible

"defers persisting" might get misread as if it doesn't write to disk - it does but you're close and you will uncover more :)

MySQL defaults are specifically tuned for your "write sprint" usecase, infact, to actually mask the IO latency of secondary index updates but if your real usecase is not just large sudden bursts of writes to a table that has a btree index, you certainly will appreciate this effort. Happy weekend!

link

daneel_w 53 days ago

Thanks a bunch for all of the config tuning advice. Can see huge improvements in both flattening the creeping slow-down and really pushing back the point where we eventually hit the wall. Just lowering the fill factor on the indices made a substantial difference even before permitting Postgres to use more RAM. Using BRIN indices for certain columns, as suggested by several others in the discussion, also resulted in some smaller but still valuable speed-ups. It's obvious by now that a careful combination of curated choices greatly matters for the full scope of this data. On the whole, in fact, it looks like all of the combined advice clear this hurdle up so much that we can no longer say we observe Postgres having an actual problem dealing with our current volumes and intensity of ingestion.

We run quite a few of these setups, so cost absolutely plays a big part in the size and selection of hardware. We're evaluating Postgres because of a mergers and acquisitions event prompting us to consolidate portions of the platforms for the sake of savings and reducing the overall "sprawl". Conversely we're also testing how the other party would fare if some of their stuff were to run under MySQL instead of Postgres. There's a chance we end up leaving things just as they are.

link

subhobroto 53 days ago

> On the whole, in fact, it looks like all of the combined advice clear this hurdle up so much that we can no longer say we observe Postgres having an actual problem dealing with our current volumes and intensity of ingestion.

I am extremely happy for you!

This is fantastic news. I would really appreciate a followup post on DB SE and/or PostgreSQL ML because I assure you, there are others in your position. A deeply nested HN thread is not something that will be visible - you would be saving a lot of people heartburn.

Also remember, we just got started! The tips I gave you earlier is just a drop i the bucket. In the days of AWS RDS, knowledge about these are a "lost art" but running PostgreSQL yourself is extremely empowering. If you could make the posts, please do and I also urge you to test out https://github.com/pgsty/pigsty

link