|
|
|
|
|
by escardin
2341 days ago
|
|
There was one, obvious thing they could have done. Backups. ZFS even makes it easy. They state they have triple redundancy on their servers. Pull 1/2 of the drives and have backups. ZFS supports streaming snapshots (maybe not on this particularly old system). It sounds like they have multiple ZFS servers per datacenter, so given 2 servers, they could use 50% of the storage on each nodes as a backup of the other node. It's not like the backups have to be customer available, use them to increase availability and decrease MTTR. In this situation, even with a daily snapshot they could have had customers up and running with yesterday's data while they took their time recovering the old system and not moving boxes around and bypassing safeties for speed. How much did five days of panic cost them? Their customers? Their brand? I feel like they read something about how S3 has at least three copies of everything, and then did that locally with ZFS, instead of accounting for all the other failures that can happen that the S3 design accounts for. You are right, there isn't a whole lot that could have been done without radically changing their infrastructure, but they're clearly at the scale and have the hardware available to make better choices than they have. |
|
Intangibles. Whenever you go to talk to management or even co-workers about this stuff, they look at you like you are crazy. I think it is just human nature to not even think that something could go wrong, let alone make decisions based on this.