Hacker News new | ask | show | jobs
by macintux 3408 days ago
How does Aurora know that the primary is dead? Automatic failover is problematic in a distributed system.
2 comments

AWS uses heartbeats for detecting liveliness. If x heartbeats fail the failover procedure is started. Generally 10s - 5minutes. In practice (for me) the failover has been less than 15s.
My concern was more around split brain. If you fail over while the write master is simply unreachable, pain results.
Aurora's read replicas share the underlying storage that the primary uses, so AWS claims that there's no data loss on failover. They also claim -- and I've never heard anyone say they were wrong -- that Aurora failovers take less than a minute. So the pain should be limited to under a minute of lost writes, which most applications can handle (with an error). It can still be painful depending on the application.

See here for more info: https://aws.amazon.com/rds/aurora/faqs/#high-availability-an...

Yeah, the latency on that failover isn't specified.
Do you mean the amount of time it takes to initiate a failover or the amount of time for a failover to complete?

For the former, I don't think they specify beyond "automatic".

For the latter, "service is typically restored in less than 120 seconds, and often less than 60 seconds": http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora...

That's a pretty good cutover, but as you say, they should also include the time needed to detect a failure and initiate the transition.