| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by edaemon 3452 days ago
	Yes, Aurora has a single write master, though it does have automatic write failover -- i.e. if the Aurora primary dies, one of your read replicas is promoted to the primary and reads/writes are directed to the new instance. That does constrain your primary's capabilities to the largest instance size (currently a db.r3.8xlarge). I don't have a good idea what the upper limit is for an Aurora database setup.

1 comments

macintux 3452 days ago

How does Aurora know that the primary is dead? Automatic failover is problematic in a distributed system.

link

CaveTech 3452 days ago

AWS uses heartbeats for detecting liveliness. If x heartbeats fail the failover procedure is started. Generally 10s - 5minutes. In practice (for me) the failover has been less than 15s.

link

macintux 3452 days ago

My concern was more around split brain. If you fail over while the write master is simply unreachable, pain results.

link

edaemon 3452 days ago

Aurora's read replicas share the underlying storage that the primary uses, so AWS claims that there's no data loss on failover. They also claim -- and I've never heard anyone say they were wrong -- that Aurora failovers take less than a minute. So the pain should be limited to under a minute of lost writes, which most applications can handle (with an error). It can still be painful depending on the application.

See here for more info: https://aws.amazon.com/rds/aurora/faqs/#high-availability-an...

link

xapata 3452 days ago

Yeah, the latency on that failover isn't specified.

link

edaemon 3452 days ago

Do you mean the amount of time it takes to initiate a failover or the amount of time for a failover to complete?

For the former, I don't think they specify beyond "automatic".

For the latter, "service is typically restored in less than 120 seconds, and often less than 60 seconds": http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora...

link

xapata 3452 days ago

That's a pretty good cutover, but as you say, they should also include the time needed to detect a failure and initiate the transition.

link