Hacker News new | ask | show | jobs
by palijer 1529 days ago
I think this document and incident is a decent example of common DR planning failure patterns.

It is explained here that Atlassian runs regular DR planning meetings with the engineers spending time planing out potential scenarios, as well as quarterly tests of backups and tracking findings from them.

So, with those two things happening, I the imagine recovery time objectives of <6 hours was taking a typical "we deleted data from a bad script run affecting a lot of customers" scenario into account with the metrics from the quarterly backup tests.

That doesn't even come close to the recovery time we are currently seeing now however. We're coming up on 2 orders of magnitude more than that.

The above doc seems pretty far our of line with what is currently happening.