|
|
|
|
|
by ZeroCool2u
2341 days ago
|
|
They do mention in the postmortem that they explicitly do not provide backups, and say so on their product page, but perhaps that it could be better communicated to customers. Designing a really robust system to failures like this is a very difficult problem. You can see this in the complexity of systems like S3 and Google's Colossus[1]. Colossus in particular is probably one of Google's single greatest competitive advantages, especially considering none of it is open sourced[2]. Comparing these guys to AWS/S3 is perhaps not entirely fair given the assumption that they have very different levels of resources. For a medium size shop and the constraints they've defined, I think this is a fair outcome of the situation. I agree though in that it could have been mitigated by making the decision to actually store backups. [1]https://www.wired.com/2012/07/google-colossus/ [2]https://cloud.google.com/files/storage_architecture_and_chal... |
|
While I did say S3, what I was really thinking about was Ceph. I don't think it's a silver bullet (almost certainly way more maintenance than a bunch of ZFS nodes), but if you're big enough to have multiple storage nodes with 100's of customers each (and again, triply redundant disks), then you could have built around the eventual failure of a node with what you already have. I'm not expecting them to hit S3's 11 9 availability, just taking a glance at what they have said about their design and proposing that basic changes to how they allocate what they already have would have avoided their problem in the first place.
I don't know what their exact situation looks like, or how they got into this situation. I see a post-mortem that says they spent 5 days trying desperately to recover customer data because they don't have backups, and they're not going to change anything about how they do things to eliminate the problem, even though it appears they have the raw storage capacity to have a backup. A sister comment says that brand damage, customer costs and recovery costs are just hypotheticals. They were, right up until this incident. Hopefully their internal postmortem has more details about what the costs were.
Clearly if they're trying to recover the customer data, it was important enough to the business to do so, and maybe it's time to re-evaluate 'no backups'.