| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ZeroCool2u 2341 days ago

This is basically the stuff of nightmares.

You can't really fault them for the zfs version being so old the feature they needed wasn't yet implemented, because the machine was literally part of the last batch to be upgraded. The root cause is just some random hardware failure that can't be anticipated.

Just bad luck. Beyond radically changing how their core infrastructure works, doesn't seem like there was a lot they could have done to prevent this. Kudos for releasing the post mortem though, at least they've been fairly honest and direct about it.

2 comments

escardin 2341 days ago

There was one, obvious thing they could have done. Backups. ZFS even makes it easy. They state they have triple redundancy on their servers. Pull 1/2 of the drives and have backups. ZFS supports streaming snapshots (maybe not on this particularly old system). It sounds like they have multiple ZFS servers per datacenter, so given 2 servers, they could use 50% of the storage on each nodes as a backup of the other node.

It's not like the backups have to be customer available, use them to increase availability and decrease MTTR. In this situation, even with a daily snapshot they could have had customers up and running with yesterday's data while they took their time recovering the old system and not moving boxes around and bypassing safeties for speed. How much did five days of panic cost them? Their customers? Their brand?

I feel like they read something about how S3 has at least three copies of everything, and then did that locally with ZFS, instead of accounting for all the other failures that can happen that the S3 design accounts for.

You are right, there isn't a whole lot that could have been done without radically changing their infrastructure, but they're clearly at the scale and have the hardware available to make better choices than they have.

link

generalpass 2341 days ago

> How much did five days of panic cost them? Their customers? Their brand?

Intangibles. Whenever you go to talk to management or even co-workers about this stuff, they look at you like you are crazy. I think it is just human nature to not even think that something could go wrong, let alone make decisions based on this.

link

RantyDave 2341 days ago

Oh, I think they recognise the damage to their brand.

link

generalpass 2341 days ago

> Oh, I think they recognise the damage to their brand.

Today, yes. Two years ago?

link

ZeroCool2u 2341 days ago

They do mention in the postmortem that they explicitly do not provide backups, and say so on their product page, but perhaps that it could be better communicated to customers.

Designing a really robust system to failures like this is a very difficult problem. You can see this in the complexity of systems like S3 and Google's Colossus[1]. Colossus in particular is probably one of Google's single greatest competitive advantages, especially considering none of it is open sourced[2].

Comparing these guys to AWS/S3 is perhaps not entirely fair given the assumption that they have very different levels of resources. For a medium size shop and the constraints they've defined, I think this is a fair outcome of the situation. I agree though in that it could have been mitigated by making the decision to actually store backups.

[1]https://www.wired.com/2012/07/google-colossus/

[2]https://cloud.google.com/files/storage_architecture_and_chal...

link

escardin 2341 days ago

Do not provide backups and don't have backups aren't the same thing. If you have triply redundant local disks, you can probably afford take half of them and use them as backups for other systems and achieve better availability results (I'm assuming it's not triply redundant for performance).

While I did say S3, what I was really thinking about was Ceph. I don't think it's a silver bullet (almost certainly way more maintenance than a bunch of ZFS nodes), but if you're big enough to have multiple storage nodes with 100's of customers each (and again, triply redundant disks), then you could have built around the eventual failure of a node with what you already have. I'm not expecting them to hit S3's 11 9 availability, just taking a glance at what they have said about their design and proposing that basic changes to how they allocate what they already have would have avoided their problem in the first place.

I don't know what their exact situation looks like, or how they got into this situation. I see a post-mortem that says they spent 5 days trying desperately to recover customer data because they don't have backups, and they're not going to change anything about how they do things to eliminate the problem, even though it appears they have the raw storage capacity to have a backup. A sister comment says that brand damage, customer costs and recovery costs are just hypotheticals. They were, right up until this incident. Hopefully their internal postmortem has more details about what the costs were.

Clearly if they're trying to recover the customer data, it was important enough to the business to do so, and maybe it's time to re-evaluate 'no backups'.

link

rsync 2341 days ago

"Beyond radically changing how their core infrastructure works, doesn't seem like there was a lot they could have done to prevent this."

If only there were a cloud storage provider that you could 'zfs send', over SSH, to ...

If only ...

link