Hacker News new | ask | show | jobs
by dotwaffle 34 days ago
GCS / Azure can survive a total AZ failure too, think of 3 x Replicas as RAID-1, whereas Erasure Coding is more like RAID-6. Only it's actually more resilient:

Let's say you have 10 data blocks, and you have 4 parity blocks. You can now lose 4 servers containing a block and still be able to repair the data, whereas in 3 x Replica you can only lose 2, and have to store everything 300% of size, instead of only 140%.

And yes, it is unreasonable how much they charge for both storage and inter-az bandwidth.

1 comments

The problem with erasure coding is that if a disk fails, you suddenly need to read 3-5x more data to reconstruct the missing data from parity blocks. This is especially problematic if your replicas are split across zones. The inter-zonal bandwidth is large, but not infinite.

So I'm pretty sure that you need to have at least 2 full copies in different AZs, and then likely at least some additional redundancy within a single AZ (in the form of erasure codes or a full mirror).

So that's at least 3-4x the amount of data. 1Tb of NVMe SSD capacity is around $200 and with 3x redundancy that's $600, or about 2 years of AWS S3 storage. As I said, it's expensive but not unreasonable.

That's what Locally Recoverable Codes (including Pyramid Codes) are designed to address: you can repair a missing block using only a subset of the blocks, which you can make sure are placed in just one zone, eliminating the cross-az bandwidth requirements. Sure, if you lose multiple blocks, you are going to end up needing that extra bandwidth, but the chances of having two blocks offline at once is very low -- if it's just for a failure rather than extended maintenance, then you have probably already repaired the first block by the time the second fails.

In fact, the RS configuration is often in the >50 data blocks and >10 parity blocks range (albeit with an LRC/nested RS config) for object stores because it's more important to have that recoverability than repair efficiency. While one large provider I worked at did have a system whereby they did effectively have two copies of the RS-encoded data (so that 130% turned into 260%) across two AZs, they were actively in the process of swapping to the blocks being evenly distributed across the AZs, near-halving the total required disk space.

As I said before, most object storage is not on SSD, it's on hard disk: it's 20% of the price per TB, and most objects are read very infrequently. I can promise you that they're not paying $200 for 1TB of SSD either... I realise prices are higher than sensible in the last 6 months, but it was fairly easy to pick up SSDs for under $50/TB at retail pricing (and hard disks for under $10/TB) only a year ago.