|
|
|
|
|
by nijave
1348 days ago
|
|
Yeah, that's the one we've had a lot of problems with. > And stronger yet when the database is unusable due to an incident the cpu is maxed out and it doesnt allow any successful connection, nothing is detected Apparently Azure's storage system that backs this uses some sort of thread pool and the thread pool can lock up/become exhausted leading to I/O starvation. When this happens, connection attempts fail. When the connection attempts fail, it can lead to a connection storm where all these new connections rolling in exhaust the CPU. The telltale indicator is Postgres checkpoints getting behind. All the while, the DB I/O metrics look like they're completely fine because it's not hitting an I/O limit, it's hitting thread pool exhaustion in the some storage system under the instance, outside of Postgres. You can also get some clues if this is the problem by enabling Performance Insights and checking the Waits tab. If all the top waits are related to I/O activity, that's another dead giveaway the storage system is locked up again. You can just web search the name of the waits to see what causes them. AWS has some nice docs detailing Postgres waits |
|
Since we have premium support (P1?), we had some internal azure postgresql engineer look at the issue and they pushed the problem back to us. Blaming our app not built correctly. That has been ping-ponging for over a year now.
Finally i saw this semi-acknowledgment in their health status yesterday.
Do you happen to know a proper solution? Are you waiting for them to fix this issue or moved to a different db service?
Perhaps the flexible server is better?