Hacker News new | ask | show | jobs
by britmob 2222 days ago
There is http://iabak.archiveteam.org, but it’s not exactly large.
3 comments

The IA is about 50 PB.

IABAK stores 100 TB, or 0.2% of it.

You can get an 8TB HDD for $150 right now. That's 6250 drives. That's about $1MM in drives, which doesn't sound that cost-prohibitive. Obviously that's not the whole cost since you need to pay for bandwidth, replication and other infrastructure like the host node, but it sounds like something that could be even hosted by a number of volunteers on r/datahoarder or r/homelab.

I also remember reading about Sia on HN, which is a dapp that pays hosts to store data and distributes it. Looking at the going rates on Sia ($1.45/TB/mo), that's $870k/yr. That's ~10% of the IA budget (which is only $10MM/yr, which sounds very efficient!) but shows that the order of magnitude is not that crazy.

I think at this scale we're no longer talking about buying drives, but securing a steady supply stream of them. So we're talking not $1MM, but a somewhat safe $1MM per year for it to be even worth considering.
BackBlaze B2 is $5/TB/month

Azure Archive is $2/TB/month ($1.68 if reserved)

AWS Glacier Deep Archive is $1/TB/month

GCP Cloud Storage Archive is $1.20/TB/month

Of course, there can be i/o and network charges, and different levels of redundancy (but possibly bulk discounts)...but the bare storage costs for for 50 PB per year would be roughly $600k - $3 MM/y.

The cloud business is a small fraction of Amazon's revenue but a large part of their profits. It's extremely profitable for them. That's why there is such a large discrepancy between (non bulk) HDD price and (non bulk) per month cost for archival.
Aren't the costs of getting that data out of the backup much larger than the cost of keeping it in the first place, to the point that when you actually need to restore a large backup, it turns out it was better to have been managing it yourself? That's the impression I got from various HN comments on the topic over the years.
Low cost to insert, low cost to keep it there, high cost to retrieve is exactly the combination you want when looking at disaster backup solutions, since you don't intend to retrieve the data frequently. Buy some earthquake insurance (I know, easier said than done) and only pay for 1/20 of the retrieval cost.
AWS has Snowball and Snowmobile though only used former to reduce data transfer costs. Dont remember what other savings are in there. Like is there price reduction if use with Glacier or not.
Isn't that inbound only? Getting the data out again is also required.
The cost is only there if you transfer out of aws. Something like glacier will have a retrieval time on the order of hours or days.
Glacier Deep Archive does charge for retrievals at $0.02 per GB and additional $0.01 per 1000 such requests (both of which are $0.00 for Standard S3). PUT, LIST, DELETE are at $0.05 per 1000 requests, 10x the Standard S3 rates.

https://aws.amazon.com/s3/pricing/

The majority of cloud costs of storage are bandwidth, so ignoring this makes the analysis meaningless.
Would it remain feasible as they scale up? Their content is growing faster and faster so the number of drives would have to rise every single month, probably by several dozen even.
> That's about $1MM

is that a "Million Million" eg. 10^12 ?

Some industries use M for thousand. For example, advertising uses CPM (cost per mille), which is the cost for 1000 clicks or views.

In this case, 6250*150 = 937500, almost a million.

It definitely stalled out. It needs a Windows version to get real traction IMO.
If I'm reading that correctly, it would only cost a bit over 500 bucks a month to host that whole archive on BackBlaze B2.

Furthermore it would not be so hard to translate Archive.org items to IPFS objects, if there were an effort to pin a significant number of them to storage and network.

Since numbers are not 100% clear...

(50 petabytes * 0.2% = 100 terabytes)

[$0.005 ($/GB/Month) BackBlaze cost]

[(50 petabytes) / (1 gigabyte) = 50,000,000]

(50,000,000 * $0.005 = $250,000 US$)

—————

Meaning based on my numbers, that is $250,000 USD a month to host 50 petabytes of data on BackBlaze.

That's a fraction of the AWS bills for many startups arguably doing absolutely nothing
They have VC money to burn. Archive.org doesn't.

Also, those backups would be (relatively) cheap to keep, but not necessarily to restore.

I would guess restore wouldn't be a problem. AWS or whoever would do it for free given it is a non-profit (in case of a disaster only, of course).
The effort is the issue here. There was this comment back when IA.BAK was in design phase https://news.ycombinator.com/item?id=9148576 And then this is all there was to show for it: https://www.archiveteam.org/index.php?title=INTERNETARCHIVE....
Yeah, I noticed that being a problem. My biggest problems with IPFS are hardly mentioned in their updates, and it's hard to tell if they have any interest.

I had constant issues with objects simply never (hours and many requests) being found, despite being pinned in several places; and the daemon sucked resources away from the system at an alarming rate back when I was trying properly.

With all that being said, maybe now is the time to look at it properly, the budgets are there. Maybe Juan, _prometheus, can find somebody to at least PoC this important application.

You're reading it correctly, but IABAK backs up 0.2% of the Internet Archive.