| HN Mirror

Where three different disks (or storage underneath Ceph OSDs) each in separate disk silos fail?

In some earlier Ceph clusters I was responsible for I set replication to four or even five.

There was a point where I wrote an internal memo to business leaders explaining that had I not set that replication to four or five we would have been unable to meet business goals (we would have lost the contracts keeping us afloat during critical stages), but now at scale the real dollar cost of that redundancy was showing. In that memo I explained what networking we needed vs what we had, what people we needed vs what we had, and so on. I eventually got the networking gear the company needed in place and was able to hire the kind of people we needed.

For Ceph clusters I am chiefly responsible for today we have Ceph pools in each availability zone doing erasure coding (four or more parity bits) and a small service makes sure objects are copied between the distanced datacenters (availability zones).

While I do get a bit of a boner (of the Hank Hill on propane kinda vibe) on how the erasure coding is distirbuted across cabinet rows, to be fair it is a bit of an optimization.

> to have lower durability (say replicas are located within the same networking pod)

A network segment should never ever go down. I know in some places this is "optimistic". I can say for us it's not "optimistic" because we don't allow Ethernet protocol to "hop" anywhere. Layer-2 broadcast domains either stop at top-of-rack and the rest is layer-3 or we are using Infiniband. Stable Ethernet networks are a thing, but not using Ethernet beyond where Ethernet belongs avoids so much risk.

100% network uptime is achievable.

In some places more critical nodes are connected to three Infiniband switches with two uninterrupted power supplies dedicated to each Infiniband switch. Maybe that's "excessive", but within the last year we had to contend with a switch failure. Then a week later a datacenter provider (one that will tell you they never ever EVER lose power) for one of our availability zones was unable to provide power for over two full days.

I don't think we're ever going to have downtime or lose data. In other places this would be "optimistic", but for this deployment I have access to the resources I need to achieve the thing. The only thing that has me concerned about downtime is the current WW3 we are already in heating up.