|
|
|
|
|
by ZeroCool2u
2341 days ago
|
|
This is basically the stuff of nightmares. You can't really fault them for the zfs version being so old the feature they needed wasn't yet implemented, because the machine was literally part of the last batch to be upgraded. The root cause is just some random hardware failure that can't be anticipated. Just bad luck. Beyond radically changing how their core infrastructure works, doesn't seem like there was a lot they could have done to prevent this. Kudos for releasing the post mortem though, at least they've been fairly honest and direct about it. |
|
It's not like the backups have to be customer available, use them to increase availability and decrease MTTR. In this situation, even with a daily snapshot they could have had customers up and running with yesterday's data while they took their time recovering the old system and not moving boxes around and bypassing safeties for speed. How much did five days of panic cost them? Their customers? Their brand?
I feel like they read something about how S3 has at least three copies of everything, and then did that locally with ZFS, instead of accounting for all the other failures that can happen that the S3 design accounts for.
You are right, there isn't a whole lot that could have been done without radically changing their infrastructure, but they're clearly at the scale and have the hardware available to make better choices than they have.