Hacker News new | ask | show | jobs
by TrogdorTheMan 1891 days ago
We managed to get a reply from a C level. All we could get out of them was "something to do with our DB, but we don't know the root yet. Our fail-over process didn't work. This will never happen again".

Also, it only took them 2 and a half hours to admit it was their entire system instead of "a small subset of users" lol.

2 comments

"something to do with our DB"

Oof. Is there something about what they do that prevents you from having a completely separate second site? Or is this a case where "bad data" is being happily propagated to the redundant site?

As core as the service is, I imagined a panic button that reverted the database for site #2 to some specified point in time.

From what I've seen on some internal email chains, there is a fail-over process for HA multi-region/site, but it didn't work right. Whomp whomp.
Wasn't their previous major outage because of a bad migration?
I don't think so, I think that it was a combo of malicious intent and some indexes that never got run. I guess you might call it a bad migration since indexes didn't get run, but that seems more like a catalyst than a root. https://cdn.auth0.com/blog/20181128-Incident-RCA.pdf