You can get an 8TB HDD for $150 right now. That's 6250 drives. That's about $1MM in drives, which doesn't sound that cost-prohibitive. Obviously that's not the whole cost since you need to pay for bandwidth, replication and other infrastructure like the host node, but it sounds like something that could be even hosted by a number of volunteers on r/datahoarder or r/homelab.
I also remember reading about Sia on HN, which is a dapp that pays hosts to store data and distributes it. Looking at the going rates on Sia ($1.45/TB/mo), that's $870k/yr. That's ~10% of the IA budget (which is only $10MM/yr, which sounds very efficient!) but shows that the order of magnitude is not that crazy.
I think at this scale we're no longer talking about buying drives, but securing a steady supply stream of them. So we're talking not $1MM, but a somewhat safe $1MM per year for it to be even worth considering.
Of course, there can be i/o and network charges, and different levels of redundancy (but possibly bulk discounts)...but the bare storage costs for for 50 PB per year would be roughly $600k - $3 MM/y.
The cloud business is a small fraction of Amazon's revenue but a large part of their profits. It's extremely profitable for them. That's why there is such a large discrepancy between (non bulk) HDD price and (non bulk) per month cost for archival.
Aren't the costs of getting that data out of the backup much larger than the cost of keeping it in the first place, to the point that when you actually need to restore a large backup, it turns out it was better to have been managing it yourself? That's the impression I got from various HN comments on the topic over the years.
Low cost to insert, low cost to keep it there, high cost to retrieve is exactly the combination you want when looking at disaster backup solutions, since you don't intend to retrieve the data frequently. Buy some earthquake insurance (I know, easier said than done) and only pay for 1/20 of the retrieval cost.
AWS has Snowball and Snowmobile though only used former to reduce data transfer costs. Dont remember what other savings are in there. Like is there price reduction if use with Glacier or not.
Glacier Deep Archive does charge for retrievals at $0.02 per GB and additional $0.01 per 1000 such requests (both of which are $0.00 for Standard S3). PUT, LIST, DELETE are at $0.05 per 1000 requests, 10x the Standard S3 rates.
Would it remain feasible as they scale up? Their content is growing faster and faster so the number of drives would have to rise every single month, probably by several dozen even.
If I'm reading that correctly, it would only cost a bit over 500 bucks a month to host that whole archive on BackBlaze B2.
Furthermore it would not be so hard to translate Archive.org items to IPFS objects, if there were an effort to pin a significant number of them to storage and network.
Yeah, I noticed that being a problem. My biggest problems with IPFS are hardly mentioned in their updates, and it's hard to tell if they have any interest.
I had constant issues with objects simply never (hours and many requests) being found, despite being pinned in several places; and the daemon sucked resources away from the system at an alarming rate back when I was trying properly.
With all that being said, maybe now is the time to look at it properly, the budgets are there. Maybe Juan, _prometheus, can find somebody to at least PoC this important application.
IABAK stores 100 TB, or 0.2% of it.