| This article seems to indicate that manually triggered failovers will always fail if your application tries to maintain its normal write traffic during that process. Not that I'm discounting the author's experience, but something doesn't quite add up: - How is it possible that other users of Aurora aren't experiencing this issue basically all the time? How could AWS not know it exists? - If they know, how is this not an urgent P0 issue for AWS? This seems like the most basic of basic usability features is 100% broken. - Is there something more nuanced to the failure case here such as does this depend on transactions in-progress? I can see how maybe the failover is waiting for in-flight transactions to close and then maybe hits a timeout where it proceeds with the other part of the failover by accident. That could explain why it doesn't seem like the issue is more widespread. |
If it's anything like how Azure handles this kind of issue, it's likely "lots of people have experienced it, a restart fixes it so no one cares that much, few have any idea how to figure out a root cause on their own, and the process to find a root cause with the vendor is so painful that no one ever sees it through"