Hacker News new | ask | show | jobs
by marcinzm 1887 days ago
Like another comment said, don't bother swapping out disks, just leave the dead ones in place and disable them in software. Then eventually either replace the whole server or get someone on site to do a mass swap of disks. At this scale redundancy needs to be spread between machines anyway so no gain in replacing disks as they die.
1 comments

That also means that you need extra spare disks in the system, which also means extra servers, extra racks, extra power feeds, extra cooling etc.

If you do a 60-disk 4U setup you'll need 1 full rack of those just to get your 10PB, then you'll need yet another one for redundancy. And then a quarter for hot spares. At that point you have single-redundancy, no file history and no scaling. Is it possible? Sure. Is this something you can do 'on a side track with the people you already heave'? Unlikely if you are a startup with no datacenter, no colocation yet etc.

You don't do redundancy that way at that scale, that's completely insane. You run ceph or beegfs or Windows Storage Server and backup to tape with a tape library. If youve got big bucks (though still peanuts compared to s3) you replicate the entire setup 1:1 at a second site.
The author doesn't want a second site. And at that scale you do redundancy at that scale within the requested parameters.

If you set your object store to be resilient to single-partition loss per object (within CAP) you effectively duplicate everything once. If you want more-than-one you get into sharding to spread the risk. We're not talking about RAID here, but about replicas or copies.

Windows Storage Server doesn't belong in a setup like this, and neither does tape since it needs to be accessible in under 1s. If higher latencies were fine the author would have been able to use something between S3 IA and Glacier. Heck, you could use cold HDD storage for that kind of access. The drives would need to spin up to collect the shards to assemble at least one replica to be able to read the file, but that's still multiple orders of magnitude faster than tape.

I have written a larger post with more numbers, and unless you seriously reduce the features you use, it's not really cheaper than S3 if you start off with no physical IT and no people to support it. It's not that it isn't possible, it's just that you need to spin up an entire business unit for it and at that point you're eating way more cost.

Regardless of the object store (or filesystem if you want to go full on legacy style), you still need at least the minimum amount of physical bits on disk to be able to store the data. And pretty much no object store supports a 1:1 logical-physical storage scale. It's almost always at least 1:1.66 in degraded mode or 1:2 in minimum operational mode.

>We're not talking about RAID here, but about replicas or copies.

Most distributed filesystems support some form of erasure coding. Ceph does, Minio does, HDFS does, etc. So no, you don't need to duplicate everything.

You're talking about data integrity, this is not the same as redundancy.
> You're talking about data integrity, this is not the same as redundancy.

To be clear, you're talking about mitigating the risk of data corruption (eg. bits will flip randomly due to cosmic rays or what have you) over time, vs. the risk of outright data loss, yes?

Isn't there some some overlap between the solutions?