| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gtowey 216 days ago

This article seems to indicate that manually triggered failovers will always fail if your application tries to maintain its normal write traffic during that process.

Not that I'm discounting the author's experience, but something doesn't quite add up:

- How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?

- If they know, how is this not an urgent P0 issue for AWS? This seems like the most basic of basic usability features is 100% broken.

- Is there something more nuanced to the failure case here such as does this depend on transactions in-progress? I can see how maybe the failover is waiting for in-flight transactions to close and then maybe hits a timeout where it proceeds with the other part of the failover by accident. That could explain why it doesn't seem like the issue is more widespread.

13 comments

twisteriffic 216 days ago

> How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?

If it's anything like how Azure handles this kind of issue, it's likely "lots of people have experienced it, a restart fixes it so no one cares that much, few have any idea how to figure out a root cause on their own, and the process to find a root cause with the vendor is so painful that no one ever sees it through"

perching_aix 216 days ago

An experience not exclusive to cloud vendors :) Even better when the vendor throws their hands up cause the issue is not reliably repro'able.

That was when I scripted away a test that ran hundreds of times a day on a lower environment, attempting repro. As they say, at scale, even insignificant issues become significant. I don't remember clearly, I think it was a 5-10% chance that the issue triggered.

At least confirming the fix, which we did eventually receive, was mostly a breeze. Had to provide an inordinate amount of captures, logs, and data to get there though. Was quite the grueling few weeks, especially all the office politics laden calls.

pixl97 216 days ago

I've had customers with load related bugs for years simply because they'd reboot when the problem happened. When dealing with the F100 it seems there is a rather limited number of people in these organizations that can troubleshoot complex issues, that or they lock them away out of sight.

perching_aix 216 days ago

It is a tough bargain to be fair, and it is seen in other places too. From developers copying out their stuff from their local git repo, recloning from remote, then pasting their stuff back, all the way to phone repair just meaning "here's a new device, we synced all your data across for you", it's fairly hard to argue with the economic factors and the effectiveness of this approach at play.

With all the enterprise solutions being distributed, loosely coupled, self-healing, redundant, and fault-tolerant, issues like this essentially just slot in perfectly. Compound this with man-hours (especially expert ones) being a lot harder to justify for any one particular bump in tail latency, and the equation is just really not there for all this.

What gets us specifically to look into things is either the issue being operationally gnarly (e.g. frequent, impacting, or both), or management being swayed enough by principled thinking (or at least pretending to be). I'd imagine it's the same elsewhere. The latter would mostly happen if fixing a given thing becomes an office political concern, or a corporate reputation one. You might wonder if those individual issues ever snowballed into a big one, but turns out human nature takes care of that just "sufficiently enough" before it would manifest "too severely". [0]

Otherwise, you're looking at fixing / RCA'ing / working around someone else's product defect on their behalf, and giving your engineers a "fun challenge". Fun doesn't pay the bills, and we rarely saw much in return from the vendor in exchange for our research. I'd love to entertain the idea that maybe behind closed doors the negotiations went a little better because of these, but for various reasons, I really doubt so in hindsight.

[0] as delightfully subjective as those get of course

hobs 216 days ago

If I had a nickel for every time I had to explain that rebooting a database server is usually the wrong choice I would have quite a fortune.

sally_glance 216 days ago

Theoretically you're supposed to assign lower prio to issues with known workarounds but then there should also be reporting for product management (which assigns weight by age of first occurrence and total count of similar issues).

Amazon is mature enough for processes to reflect this, so my guess for why something like this could slip through is either too many new feature requests or many more critical issues to resolve.

pwarner 216 days ago

Azure yes, I'd expect this and the restart would take many minutes. Been there done that.

AWS this is surprising

theanomaly 216 days ago

I'm surprised this hasn't come up more often too. When we worked with AWS on this, they confirmed there was nothing unique about our traffic pattern that would trigger this issue. We also didn't run into this race condition in any of our other regions running similar workloads. What's particularly concerning is that this seems to be a fundamental flaw in Aurora's failover mechanism that could theoretically affect anyone doing manual failover.

kobalsky 216 days ago

> - How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists?

I know that there is no comparison in the user base, but a few years ago I ran into a massive Python + MySQL bug that:

1. made SELECT ... FOR UPDATE fail silenty 2. aborted the transaction and set the connection into autocommit mode

This basically a worst case scenario in a transactional system.

I was basically screaming like a mad man in the corner but no one seemed to care.

Someone contacted me months later telling me that they experienced the same problem with "interesting" consequences in their system.

The bug was eventually fixed but at that point I wasn't tracking it anymore, I provided a patch when I created the issue and moved on.

https://stackoverflow.com/questions/945482/why-doesnt-anyone...

sroussey 216 days ago

Converting a connection to autocommit upon error. Yikes!!

evanelias 216 days ago

If I'm reading this correctly, it sounds like the connection was already using autocommit by default? In that situation, if you initiate a transaction, and then it gets rolled back, you're back in autocommit unless/until you initiate another transaction.

If so, that part is all totally normal and expected. It's just that due to a bug in the Python client library (16 years ago), the rollback was happening silently because the error was not surfaced properly by the client library.

yencabulator 214 days ago

Is there any scenario in a sane world where a transaction ceases to be in scope just because it went into an error state? I'd have expect the client to send an explicit ROLLBACK when they realize a transaction is in an error state, not for the server to end it and just notify the client. This is how psql appears to the end user.

  postgres=# begin;
  BEGIN
  postgres=*# bork;
  ERROR:  syntax error at or near "bork"
  LINE 1: bork;
          ^
  postgres=!# select 1;
  ERROR:  current transaction is aborted, commands ignored until end of transaction block
  postgres=!# rollback;
  ROLLBACK
  postgres=# select 1;
   ?column?
  ----------
          1
  (1 row)
  
  postgres=#

evanelias 214 days ago

Every DBMS handles errors slightly differently. In a sane world you shouldn't ever ignore errors from the database. It's unfortunate to hear that a Python MySQL client library had a bug that failed to expose errors properly in one specific situation 16 years ago, but that's not terribly relevant to today.

Postgres behavior with errors isn't even necessarily desirable -- in terms of ergonomics, why should every typo in an interactive session require me to start my transaction over from scratch?

yencabulator 214 days ago

> why should every typo in an interactive session require me to start my transaction over from scratch?

That part would hold even for the MySQL auto-rollback implied above.

o11c 216 days ago

I would argue that it's a bug for it even to be possible to autocommit.

evanelias 216 days ago

What do you mean? Autocommit mode is the default mode in Postgres and MS SQL Server as well. This is by no means a MySQL-specific behavior!

When you're in autocommit mode, BEGIN starts an explicit transaction, but after that transaction (either COMMIT or ROLLBACK), you return to autocommit mode.

The situation being described upthread is a case where a transaction was started, and then rolled back by the server due to deadlock error. So it's totally normal that you're back in autocommit mode after the rollback. Most DBMS handle this identically.

The bug described was entirely in the client library failing to surface the deadlock error. There's simply no autocommit-related bug as it was described.

o11c 216 days ago

Yes, and most DBMS's are full of historical mistakes.

In a sane world, statements outside `BEGIN` would be an unconditional error.

aetherson 216 days ago

My experience with AWS is that they are extremely, extremely parsimonious about any information they give out. It is near-impossible to get them to give you any details about what is happening beyond the level of their API. So my gut hunch is that they think that there's something very rare about this happening, but they refuse to give the article writer the information that might or might not help them avoid the bug.

everfrustrated 216 days ago

If you pay for the highest level of support you will get extremely good support. But it comes with signing a NDA so you're not going to read about anything coming out of it on a blog.

I've had AWS engineers confirm very detailed and specific technical implementation details many many times. But these were at companies that happily spent over a $1M/year with AWS.

qaq 216 days ago

Nah if your monthly spend is really significant than you will get good support and issues you care about will get prioritized. Going from startup with 50K/month spend to a large company with untold millions per month spend experience is night and day. We have Dev managers and eng. from key AWS teams present in meetings when need be, we get issues we raise prioritized and added to dev roadmaps etc.

aetherson 216 days ago

I was at a company that spent over $90M a year with AWS and we got defensive, limited comms.

maherbeg 216 days ago

Yeah I agree, this seems like a pretty critical feature to the Aurora product itself. We saw a similar behavior recently where we had a connection pooler in between which indicates something wrong with how they propagate DNS changes during the failover. wtf aws

CaptainKanuk 216 days ago

Whenever we have to do any type of AWS Aurora or RDS cluster modification in prod we always have the entire emergency response crew standing by right outside the door.

Their docs are not good and things frequently don't behave how you expect them to.

ekropotin 216 days ago

Oh, well, it’s always DNS!

Hovertruck 216 days ago

Agreed, we've been running multiple aurora clusters in production for years now and have not encountered this issue with failovers.

dalyons 216 days ago

Same. There’s something missing here.

belter 216 days ago

The article is low quality. It does not mention which Aurora PostgreSQL version was involved, and it provides no real detail about how the staging environment differed from production, only saying that staging “didn’t reproduce the exact conditions,” which is not actionable.

This AWS documentation section: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraPostgreSQ...

“Amazon Aurora PostgreSQL updates”: under Aurora PostgreSQL 17.5.3, September 16, 2025 – Critical stability enhancements includes a potential match:

“...Fixed a race condition where an old writer instance may not step down after a new writer instance is promoted and continues to write…”

If that is the underlying issue, it would be serious, but without more specifics we can’t draw conclusions.

For context: I do not work for AWS, but I do run several production systems on Aurora PostgreSQL. I will try to reproduce this using the latest versions over the next few hours. If I do not post an update within 24 hours, assume my tests did not surface anything.

That would not rule out a real issue in certain edge cases, configurations, or version combinations but it would at least suggest it is not broadly reproducible.

theanomaly 213 days ago

We're running Aurora PostgreSQL 15.12, which includes the fix mentioned in the release notes. Looking at this comment and the AWS documentation, I think there's an important distinction to make about what was actually fixed in Aurora PostgreSQL 15.12.4. Based on our experience and analysis, we believe AWS's fix primarily focused on data protection rather than eliminating the race condition itself.

Here's what we think is happening: Before the fix (pre-15.12.4):

1. Failover starts

2. Both instances accept and process writes simultaneously

3. Failover eventually completes after the writer steps down

4. Result: Potential data consistency issues ???

After the fix (15.12.4+):

1. Failover starts

2. If the old writer doesn't demote before the new writer is promoted, the storage layer now detects this and rejects write requests

3. Both instances restart/crash

4. Failover fails or requires manual intervention

The underlying race condition between writer demotion and reader promotion still exists - AWS just added a safety mechanism at the storage layer to prevent the dangerous scenario of two writers operating simultaneously. They essentially converted a data inconsistency risk into an availability issue. This would explain why we're still seeing failover failures on 15.12 - the race condition wasn't eliminated, just made safer.

The comment in the release notes about "fixed a race condition where an old writer instance may not step down" is somewhat misleading - it's more accurate to say they "mitigated the consequences of the race condition" by having the storage layer reject writes when it detects the problematic state and that is probably why AWS Support did not point us to this release when we raised the issue.

nijave 216 days ago

fwiw we haven't seen issues manually doing manual failovers for maintenance using the same/similar procedure described in the article. I imagine there is something more nuanced here and it's hard to draw too many conclusions without a lot more details being provided by AWS

grogers 215 days ago

It sounds like part of the problem was how the application reacted to the reverted fail over. They had to restart their service to get writes to be accepted, implying some sort of broken caching behavior where it kept trying to send queries to the wrong primary.

It's at least possible that this sort of aborted failover happens a fair amount, but if there's no downtime then users just try again and it succeeds, so they never bother complaining to AWS. Unless AWS is specifically monitoring for it, they might be blind to it happening.

benmmurphy 216 days ago

it could be most people pause writes because its going to create errors if you try and execute a write against an instance that refuses to accept and writes, and for some people those errors might not be recoverable. so they just have some option in their application that puts the application into maintenance mode where it will hard reject writes at the application layer.

nrhrjrjrjtntbt 216 days ago

P0 if it happens to everyone, right? Like the USE1 outage recently. If it is 0.001% of customers (enough to get a HN story) is may not be that high. Maybe this customer is on a migration or upgrade path under the hood. Or just on a bad unit in the rack.

dboreham 216 days ago

Although the article has an SEO-optimized vibe, I think it's reasonable to take it as true until refuted. My rule of thumb is that any rarely executed, very tricky operation (e.g. database writer fail over) is likely to not work because there are too many variables in play and way too few opportunities to find and fix bugs. So the overall story sounds very plausible to me. It has a feel of: it doesn't work under continuous heavy write load, in combination with some set of hardware performance parameters that plays badly with some arbitrary time out. Note that the system didn't actually fail. It just didn't process the fail over operation. It reverted to the original configuration and afaics preserved data.

biggoodwolf 216 days ago

I recall seeing this also happening in CosmosDB. Both auto and manual