Hacker News new | ask | show | jobs
by technion 3988 days ago
I've seen multiple deployments (not necessarily specific to PostgresSQL) engineer themselves into a corner with what people feel will be a highly available roll your own solution, complete with convincing sounding blog posts.

In every case, at some point, there were implementation/software bug related issues that ultimately caused more unplanned outages than I've ever seen a single, well run server experience.

1 comments

Based on experience is there a common bug or scenario that you see overlooked often? Like say what happens during the transition between leaders, or handling multiple failures (multiple netsplits..)?
I can't really identify a common problem. Things I've seen include:

* After a complete, planned shutdown, neither server is happy to start until it sees the other one online. In the end, neither ends up booting. * A failover occurs, at which point you find out the hard way there is state being stored in a non-replicate file. I've seen this with several different Asterisk HA solutions in particular. * A failover occurs, and non-database aware storage snapshots leave the redundant server with a non-mountable mirror of the database.