Hacker News new | ask | show | jobs
by sethhochberg 1885 days ago
Its also important to consider how often disks will fail when you are operating hundreds of them - its probably more often than you'd think, and if you don't have someone on staff and nearby to your colo provider you're going to pay a lot in remote hands fees.

Your colo facility will almost certainly have 24/7 staff on hand who can help you with tasks like swapping disks from a pile of spares, but expect to pay $300+ minimum just to get someone to walk over to your racks, even if the job is 10 mins.

With that said, the cost savings can still be enormous. But know what you're getting into.

3 comments

Like another comment said, don't bother swapping out disks, just leave the dead ones in place and disable them in software. Then eventually either replace the whole server or get someone on site to do a mass swap of disks. At this scale redundancy needs to be spread between machines anyway so no gain in replacing disks as they die.
That also means that you need extra spare disks in the system, which also means extra servers, extra racks, extra power feeds, extra cooling etc.

If you do a 60-disk 4U setup you'll need 1 full rack of those just to get your 10PB, then you'll need yet another one for redundancy. And then a quarter for hot spares. At that point you have single-redundancy, no file history and no scaling. Is it possible? Sure. Is this something you can do 'on a side track with the people you already heave'? Unlikely if you are a startup with no datacenter, no colocation yet etc.

You don't do redundancy that way at that scale, that's completely insane. You run ceph or beegfs or Windows Storage Server and backup to tape with a tape library. If youve got big bucks (though still peanuts compared to s3) you replicate the entire setup 1:1 at a second site.
The author doesn't want a second site. And at that scale you do redundancy at that scale within the requested parameters.

If you set your object store to be resilient to single-partition loss per object (within CAP) you effectively duplicate everything once. If you want more-than-one you get into sharding to spread the risk. We're not talking about RAID here, but about replicas or copies.

Windows Storage Server doesn't belong in a setup like this, and neither does tape since it needs to be accessible in under 1s. If higher latencies were fine the author would have been able to use something between S3 IA and Glacier. Heck, you could use cold HDD storage for that kind of access. The drives would need to spin up to collect the shards to assemble at least one replica to be able to read the file, but that's still multiple orders of magnitude faster than tape.

I have written a larger post with more numbers, and unless you seriously reduce the features you use, it's not really cheaper than S3 if you start off with no physical IT and no people to support it. It's not that it isn't possible, it's just that you need to spin up an entire business unit for it and at that point you're eating way more cost.

Regardless of the object store (or filesystem if you want to go full on legacy style), you still need at least the minimum amount of physical bits on disk to be able to store the data. And pretty much no object store supports a 1:1 logical-physical storage scale. It's almost always at least 1:1.66 in degraded mode or 1:2 in minimum operational mode.

>We're not talking about RAID here, but about replicas or copies.

Most distributed filesystems support some form of erasure coding. Ceph does, Minio does, HDFS does, etc. So no, you don't need to duplicate everything.

You're talking about data integrity, this is not the same as redundancy.
I currently pay about $40 for a half hour of remote hands at a large data center. Modern disks rarely need to be swapped. You can look at BackBlaze's published failure rates and do the math yourself if you don't believe me.
I’ve used Netapps and Isilon in the past. We didn’t change any disks, they did as part of the maintenance. Not sure how the physical security worked but they were let in by the data centre staff and did their thing. I think they came in weekly.

They whole solution wasn’t cheap though and all of these extras were baked into the cost. We were getting better than S3 costs from a per TB straight up without considering power , cooling and rack space costs. Network was significantly cheaper than AWS.

Not sure on how far these NAS’ scale but I would expect deep discounts for something of this scale.