Hacker News new | ask | show | jobs
by edaemon 3405 days ago
Yes, Aurora has a single write master, though it does have automatic write failover -- i.e. if the Aurora primary dies, one of your read replicas is promoted to the primary and reads/writes are directed to the new instance. That does constrain your primary's capabilities to the largest instance size (currently a db.r3.8xlarge).

I don't have a good idea what the upper limit is for an Aurora database setup.

1 comments

How does Aurora know that the primary is dead? Automatic failover is problematic in a distributed system.
AWS uses heartbeats for detecting liveliness. If x heartbeats fail the failover procedure is started. Generally 10s - 5minutes. In practice (for me) the failover has been less than 15s.
My concern was more around split brain. If you fail over while the write master is simply unreachable, pain results.
Aurora's read replicas share the underlying storage that the primary uses, so AWS claims that there's no data loss on failover. They also claim -- and I've never heard anyone say they were wrong -- that Aurora failovers take less than a minute. So the pain should be limited to under a minute of lost writes, which most applications can handle (with an error). It can still be painful depending on the application.

See here for more info: https://aws.amazon.com/rds/aurora/faqs/#high-availability-an...

Yeah, the latency on that failover isn't specified.
Do you mean the amount of time it takes to initiate a failover or the amount of time for a failover to complete?

For the former, I don't think they specify beyond "automatic".

For the latter, "service is typically restored in less than 120 seconds, and often less than 60 seconds": http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora...

That's a pretty good cutover, but as you say, they should also include the time needed to detect a failure and initiate the transition.