Hacker News new | ask | show | jobs
by Smrchy 4806 days ago
It would be really interesting to know the average size of an object and visualize the amount of harddisks it takes to store all this data.
3 comments

To put this into perspective it's 2.7B new objects per day (assuming 1 Trillion objects averaged over 365 days).

Assuming each object is 100KB (generous estimate, after compression) that would be 270GB per day -- or assuming ten levels of redundancy and striped across three RAID storage devices (per level of redundancy) then 8.1TB per day.

I'm not familiar with their hard disk procurement policies but it wouldn't be difficult to assume they've been purchasing 1TB drives, so 10 new disk drives per day just for keeping ahead of growth. Furthermore let's assume their disk drive failure churn rate is 10% per day so another 1 new disk drive for parts replacement (so 11 disk drives per day).

These are really loose numbers not based on any actual data (or any personal experience at all) but just napkin math, so take it all with a grain of salt.

I'm not convinced that 100KB is a great estimate on file size, but either way you're off by a few zeroes. It's not 270GB per day, it's 270TB. Even if each object were just one byte, that would be 2.7GB. 100KB is one hundred thousand bytes. So it's quite a bit more than eleven drives per day!
You are correct, that would be 270TB.

After applying the same shoddy math with each object being 100KB -- 270TB with 10 levels of redundancy across 3 RAID drives resulting in 8,100TB per day. This would be 8,100 drives (at 1TB per drive), or 8,910 drives after 10% being dead-on-arrival.

The math is sketchy, so let's cut it down by 10x (10KB per object): 891 drives per day. Keep in mind this is just for S3 and it doesn't account for existing drives failing, growth, or what other services require (eg: EC2, RDS, Cloudwatch, Cloudfront, etc).

Average size is irrelevant. A handful of 1G objects dwarf hundreds of 1 byte objects when computing the average. The overal distribution is interesting. There are actually three: GET sizes, PUT sizes, and stored sizes. They are not identical distributions, especially since the as the PUT size distribution has changed it becomes out of synch with the stored size distribution. Wish I could tell you more, there are some fascinating data points in there but, you know, NDA. Source: form S3 employee.
You could then multiply by 2 trillion to work out their total data volume under management, which I believe they consider commercially sensitive information, but I can't quite put my finger on why.
It would help competitors form a view of their cost structure, which would allow them to optimise a price which would put Amazon into the red.