| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nijave 1348 days ago

Yeah, that's the one we've had a lot of problems with.

> And stronger yet when the database is unusable due to an incident the cpu is maxed out and it doesnt allow any successful connection, nothing is detected

Apparently Azure's storage system that backs this uses some sort of thread pool and the thread pool can lock up/become exhausted leading to I/O starvation. When this happens, connection attempts fail. When the connection attempts fail, it can lead to a connection storm where all these new connections rolling in exhaust the CPU. The telltale indicator is Postgres checkpoints getting behind.

All the while, the DB I/O metrics look like they're completely fine because it's not hitting an I/O limit, it's hitting thread pool exhaustion in the some storage system under the instance, outside of Postgres.

You can also get some clues if this is the problem by enabling Performance Insights and checking the Waits tab. If all the top waits are related to I/O activity, that's another dead giveaway the storage system is locked up again. You can just web search the name of the waits to see what causes them. AWS has some nice docs detailing Postgres waits

1 comments

xiwenc 1348 days ago

Thanks for the detailed explanation! We didnt look into this so detailed yet but what you are describing sounds familiar.

Since we have premium support (P1?), we had some internal azure postgresql engineer look at the issue and they pushed the problem back to us. Blaming our app not built correctly. That has been ping-ponging for over a year now.

Finally i saw this semi-acknowledgment in their health status yesterday.

Do you happen to know a proper solution? Are you waiting for them to fix this issue or moved to a different db service?

Perhaps the flexible server is better?

link

nijave 1346 days ago

We've talked to the Postgres product engineers many times. Proper solution is to run away from Single Server as quick as possible. Flexible or Citus Hyperscale may be good solutions. We're currently using Patroni to manage VM-based clusters (but still have a lot of data on SS)

Personally, I'd look into a 3rd party if you want managed Postgres (assuming you don't have contractual obligations that might complicate 3rd party access). There's vendors like EnterpriseDB, Scalegrid, etc that provide various solutions (I don't have any recomendations here--Postgres has a list of managed providers by country https://www.postgresql.org/support/professional_hosting/nort...)

link

llama052 1345 days ago

The hard part for us is figuring out how to migrate away from single server when it's used in production. It takes eternity to migrate data away from the thing, we are looking at ~24 hours just to get data out, and then we need to figure out how to do a live cutover or backfill.

Absolutely agree on a third party. Azure is just a let down overall.

link