Hacker News new | ask | show | jobs
by NikolaeVarius 2534 days ago
In many HA setups, you're supposed to not have to care if any single thing goes down because it should auto recover

The article said that the node stalled in a way that was unforseen which may have caused standard recovery mechanisms to silently fail.

1 comments

Right, but they didn't recover speedily. To have the cluster in such a state for so long sounds like poor monitoring to me because this can knowingly interfere with an election later.
The health check said it was ok. How would they know it needed to be recovered?

The fault was the bad health check. Not the process.

They only just clarified that monitoring was in place and they were reporting as healthy. See the comments above.