| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by macintux 3408 days ago
	How does Aurora know that the primary is dead? Automatic failover is problematic in a distributed system.

2 comments

CaveTech 3408 days ago

AWS uses heartbeats for detecting liveliness. If x heartbeats fail the failover procedure is started. Generally 10s - 5minutes. In practice (for me) the failover has been less than 15s.

link

macintux 3408 days ago

My concern was more around split brain. If you fail over while the write master is simply unreachable, pain results.

link

edaemon 3408 days ago

Aurora's read replicas share the underlying storage that the primary uses, so AWS claims that there's no data loss on failover. They also claim -- and I've never heard anyone say they were wrong -- that Aurora failovers take less than a minute. So the pain should be limited to under a minute of lost writes, which most applications can handle (with an error). It can still be painful depending on the application.

See here for more info: https://aws.amazon.com/rds/aurora/faqs/#high-availability-an...

link

xapata 3408 days ago

Yeah, the latency on that failover isn't specified.

link

edaemon 3408 days ago

Do you mean the amount of time it takes to initiate a failover or the amount of time for a failover to complete?

For the former, I don't think they specify beyond "automatic".

For the latter, "service is typically restored in less than 120 seconds, and often less than 60 seconds": http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora...

link

xapata 3408 days ago

That's a pretty good cutover, but as you say, they should also include the time needed to detect a failure and initiate the transition.

link