Hacker News new | ask | show | jobs
by _1tem 462 days ago
I have a feeling that economies of scale have a point of diminishing returns. At what point does it become more costly and complicated to store your data on S3 versus just maintaining a server with RAID disks somewhere?

S3 is an engineering marvel, but it's an insanely complicated backend architecture just to store some files.

6 comments

That's going to depend a lot on what your needs are, particularly in terms of redundancy and durability. S3 takes care of a lot of that for you.

One server with a RAID array can survive, usually, 1 or maybe 2 drive failures. The remaining drives in the array will have to do more work when a failed drive is replaced and data is copied to the new array member. This sometimes leads to additional failures before replacement completes, because all the drives in the array are probably all the same model bought at the same time and thus have similar manufacturing quality and materials. This is part of why it's generally said that RAID != backup.

You can make a local backup to something like another server with its own storage, external drives, or tape storage. Capacity, recovery time, and cost varies a lot across the available options here. Now you're protected against the original server failing, but you're not protected against location-based impacts - power/network outages, weather damage/flooding, fire, etc.

You can make a remote backup. That can be in a location you own / control, or you can pay someone else to use their storage.

Each layer of redundancy adds cost and complexity.

AWS says they can guarantee 99.999999999% durability and 99.99% availability. You can absolutely design your own system that meets those thresholds, but that is far beyond what one server with a RAID array can do.

How many businesses or applications really need 99.999999999% durability and 99.99% availability? Is your whole stack organized to deliver the forementioned durability and availability?
I think that this is, to Andy's point, basically about simplicity. It's not that your business necessarily needs 11 9s of durability for continuity purposes, but it sure is nice that you never have to think about the durability of the storage layer (vs. even something like EBS where 5 9s of durability isn't quite enough to go from "improbable" to "impossible").
There are a lot of companies who their livelihood depends on their proprietary data, and loss of that data would be a company-ending-event. I'm not sure how the calculus works out exactly, but having additional backups and types of backups to reduce risk is probably one of the smaller business expenses one can pick up. Sending a couple TB of data to three+ cloud providers on top of your physical backups is in the tens of dollars per month.
Different people and organizations will have different needs, as indicated in the first sentence of my post. For some use cases one server is totally fine, but it's good to think through your use cases and understand how loss of availability or loss of data would impact you, and how much you're willing to pay to avoid that.

I'll note that data durability is a bit of a different concern than service availability. A service being down for some amount of time sucks, but it'll probably come back up at some point and life moves on. If data is lost completely, it's just gone. It's going to have to be re-created from other sources, generated fresh, or accepted as irreplaceable and lost forever.

Some use cases can tolerate losing some or all of the data. Many can't, so data durability tends to be a concern for non-trivial use cases.

> One server with a RAID array can survive, usually, 1 or maybe 2 drive failures.

Standard RAID configurations can only handle 2 failures, but there are libraries and filesystems allowing arbitrarily high redundancy.

As long as it's all in one server, there's still a lot of situations that can immediately cut through all that redundancy.

As long as it's all in one physical location, there's still fire and weather as ways to immediately cut through all that redundancy.

Probably never. The complexity is borne by Amazon. Even before any of the development begins if you want a RAID setup with some sort of decent availability you've already multiplied your server costs by the number of replicas you'd need. It's a Sisyphean task that also has little value for most people.

Much like twitter it's conceptually simple but it's a hard problem to solve at any scale beyond a toy.

One interesting thing about S3 is the vast scale of it. E.g. if you need to store 3 PB of data you might need 150 HDDs + redundancy, but if you store it on S3 it's chopped up and put on tens of thousands of HDDs, which helps with IOPS and throughput. Of course that's shared with others, which is why smart placement is key, so that hot objects are spread out.

Some details in https://www.allthingsdistributed.com/2023/07/building-and-op... / https://www.youtube.com/watch?v=sc3J4McebHE

What's the difference between "IOPS" and "throughput"?
There are a few stories from companies that moved away from S3, like Dropbox, and who shared their investments and expenses.

The long and short of it is that getting anywhere near the redundancy, reliability, performance etc of S3, you're spending A Lot of money.

There is a diminishing return of what percentage you save, sure. But amazon will always be at that edge. They already have amortized the equipment, labour, administration, electricity, storage, cooling, etc.

They also already have support for storage tiering, replication, encryption, ACLs, integration with other services (from web access to sending notifications of storage events to lambda, sqs, etc). Uou get all of this whether you're saving 1 eight bit file or trillions of gigabyte sized ones.

There are reasons why you may need to roll your own storage setup (regulatory, geographic, some other unique reason), but you'll never be more economical than S3, especially if the storage is mostly sitting idle.

> At what point does it become more costly and complicated to store your data on S3 versus just maintaining a server with RAID disks somewhere?

It's more costly immediately. S3 storage prices are above what you would pay even for triply redundant media and you have to pay for data transfer at a very high rate to both send and receive data to the public internet.

It's far less complicated though. You just create a bucket and you're off to the races. Since the S3 API endpoints are all public there's not even a delay for spinning up the infrastructure.

Where S3 shines for me is two things. Automatic lifecycle management. Objects can be moved in between storage classes based on the age of the object and even automatically deleted after expiration. The second is S3 events which are also _durable_ and make S3 into an actual appliance instead of just a convenient key/value store.

Care to elaborate? What you’re saying doesn’t match my experience.

I’ve paid pennies a year to store data in s3 for the better part of 5 years. Can’t even buy a hdd with what I’ve spent on s3.

The per GB price on S3 is higher than on bulk HDDs. This is easily observed. What you are saying is your data storage needs don't even justify a single HDD. This is a scaling issue and not a pricing issue.
Oh, so “it’s more costly immediately” actually meant “it’s more costly once you’re storing over some threshold of data.” Ok. I can get behind that.

I don’t think it’s so easily observed at scale though, because at that point it’s hardly just the hdd cost anymore. It’s the hdd, server/compute, cabling, cooling, power, facilities, security, cabling, maintenance.

The TCO of data storage isn't just the drive - it just so happens that it’s still less than the cost of a drive up to some threshold. I don’t know if anyone having done a full cost model comparison. Everything I’ve ever seen assumes the data center is free.