Hacker News new | ask | show | jobs
by escardin 2341 days ago
There was one, obvious thing they could have done. Backups. ZFS even makes it easy. They state they have triple redundancy on their servers. Pull 1/2 of the drives and have backups. ZFS supports streaming snapshots (maybe not on this particularly old system). It sounds like they have multiple ZFS servers per datacenter, so given 2 servers, they could use 50% of the storage on each nodes as a backup of the other node.

It's not like the backups have to be customer available, use them to increase availability and decrease MTTR. In this situation, even with a daily snapshot they could have had customers up and running with yesterday's data while they took their time recovering the old system and not moving boxes around and bypassing safeties for speed. How much did five days of panic cost them? Their customers? Their brand?

I feel like they read something about how S3 has at least three copies of everything, and then did that locally with ZFS, instead of accounting for all the other failures that can happen that the S3 design accounts for.

You are right, there isn't a whole lot that could have been done without radically changing their infrastructure, but they're clearly at the scale and have the hardware available to make better choices than they have.

2 comments

> How much did five days of panic cost them? Their customers? Their brand?

Intangibles. Whenever you go to talk to management or even co-workers about this stuff, they look at you like you are crazy. I think it is just human nature to not even think that something could go wrong, let alone make decisions based on this.

Oh, I think they recognise the damage to their brand.
> Oh, I think they recognise the damage to their brand.

Today, yes. Two years ago?

They do mention in the postmortem that they explicitly do not provide backups, and say so on their product page, but perhaps that it could be better communicated to customers.

Designing a really robust system to failures like this is a very difficult problem. You can see this in the complexity of systems like S3 and Google's Colossus[1]. Colossus in particular is probably one of Google's single greatest competitive advantages, especially considering none of it is open sourced[2].

Comparing these guys to AWS/S3 is perhaps not entirely fair given the assumption that they have very different levels of resources. For a medium size shop and the constraints they've defined, I think this is a fair outcome of the situation. I agree though in that it could have been mitigated by making the decision to actually store backups.

[1]https://www.wired.com/2012/07/google-colossus/

[2]https://cloud.google.com/files/storage_architecture_and_chal...

Do not provide backups and don't have backups aren't the same thing. If you have triply redundant local disks, you can probably afford take half of them and use them as backups for other systems and achieve better availability results (I'm assuming it's not triply redundant for performance).

While I did say S3, what I was really thinking about was Ceph. I don't think it's a silver bullet (almost certainly way more maintenance than a bunch of ZFS nodes), but if you're big enough to have multiple storage nodes with 100's of customers each (and again, triply redundant disks), then you could have built around the eventual failure of a node with what you already have. I'm not expecting them to hit S3's 11 9 availability, just taking a glance at what they have said about their design and proposing that basic changes to how they allocate what they already have would have avoided their problem in the first place.

I don't know what their exact situation looks like, or how they got into this situation. I see a post-mortem that says they spent 5 days trying desperately to recover customer data because they don't have backups, and they're not going to change anything about how they do things to eliminate the problem, even though it appears they have the raw storage capacity to have a backup. A sister comment says that brand damage, customer costs and recovery costs are just hypotheticals. They were, right up until this incident. Hopefully their internal postmortem has more details about what the costs were.

Clearly if they're trying to recover the customer data, it was important enough to the business to do so, and maybe it's time to re-evaluate 'no backups'.